COMPARING PERFORMANCE OF DIFFERENT MODELS

1. INTRODUCTION

The aim of this homework is to compare the performance of penalized regression approaches, decision trees, and tree-based ensembles. These 4 models will be trained on 4 datasets and their performance will be compared based on test errors.

In the assignment rpart, glmnet, gbm, rattle, caret , randomForest packages are used for modeling purposes.

library(knitr)
library(skimr)
library(dplyr)
library(rpart)
library(rattle)
library(data.table)
library(tidyverse)
library(glmnet)
library(caret)
library(randomForest)
library(gbm)
library(e1071)
library(plyr)
library(smooth)
library(pROC)
library(plm)

2. DATA

Brief information about the chosen datasets and related links can can be found below.

airbnb_data <- fread("~/Desktop/IE582_HW4/airbnb.csv")
forest_data <- fread("~/Desktop/IE582_HW4/forest.csv")
news_data <- fread("~/Desktop/IE582_HW4/news.csv")
HR_data <- fread("~/Desktop/IE582_HW4/HR.csv")

Airbnd Dataset

This dataset contains AirBNB listings in London, UK. There are categorical variables such as property type, room type, location. Also there are binary variable such as availability. And numerical variables such as deposit and cleaning fee.

Task: Regression Objective: Predict listing price Variables: 42

The data can be downloaded from this link

Forest Dataset

This dataset contains tree observations from four areas of the Roosevelt National Forest in Colorado. It includes information on tree type, shadow coverage, distance to nearby landmarks, soil type, and local topography and more. All observations are catogarical variables. There are 7 forest cover type from 1 to 7. Given is the attribute name, attribute type, the measurement unit and a brief description. Finding the forest type is the classification problem.

Task: Multi-class Classification Objective: Predict forest type Variables: 55

The data can be downloaded from this link

HR Dataset

It is dataset that includes factors that lead to employee turnover.It has binary features such as gender and marital status. It has ordinal features such as education level, hob involvement and job level. It also has some numerical features such as monthly income and age.

Task: Binary classification with class imbalance of ratio 5:1 Objective: Predict employee turnover Variables: 35

link

News Dataset

This dataset summarizes a set of features about online news published by a media platform. The goal is to predict the number of shares in social networks. The dataset has some categorical and binary features such as “week day is Monday?”. It also has numerical features such as number of words.

Task: Regression Objective: Predict popularity Variables: 61 (58 predictive attributes, 2 non-predictive, 1 goal field)

The data and explanation of variable names can be found in this link

3. AIRBNB DATASET

3.1 Data Manipulation

Prices are continuous. But according to the histogram it does not look like normally distributed.

hist(airbnb_data$price_x, breaks=2000, xlim=c(0,500))

I removed descriptive variables and variables that do not have variance. And I transformed numerical variables that has unique values less than 10 to caategorical.

colnames(airbnb_data) <- make.names(colnames(airbnb_data))

airbnb_data <- airbnb_data   %>% 
  filter(price_x<500)


airbnb_data_drop <- select(airbnb_data, -c("V1","id","host_location","host_since", "street","security_deposit","cleaning_fee","city", "maximum_nights","is_business_travel_ready","requires_license","cancellation_policy","property_type","bed_type"))

airbnb_data_omit <- na.omit(airbnb_data_drop)

airbnb_data_factors <- airbnb_data_omit %>% 
  select_if(~n_distinct(.) <= 10)


airbnb_data_fix <- airbnb_data_omit %>% 
  mutate_at(c("experiences_offered" , "host_is_superhost","host_has_profile_pic", "host_identity_verified", "is_location_exact","room_type_y" , "instant_bookable", "require_guest_profile_picture", "require_guest_phone_verification" ), factor)



skim(airbnb_data_fix)
Data summary
Name airbnb_data_fix
Number of rows 80234
Number of columns 28
_______________________
Column type frequency:
factor 9
numeric 19
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
experiences_offered 0 1 FALSE 5 non: 78712, bus: 515, fam: 443, soc: 366
host_is_superhost 0 1 FALSE 2 f: 67933, t: 12301
host_has_profile_pic 0 1 FALSE 2 t: 79937, f: 297
host_identity_verified 0 1 FALSE 2 f: 52044, t: 28190
is_location_exact 0 1 FALSE 2 t: 54983, f: 25251
room_type_y 0 1 FALSE 4 Ent: 44639, Pri: 34357, Hot: 654, Sha: 584
instant_bookable 0 1 FALSE 2 f: 46749, t: 33485
require_guest_profile_picture 0 1 FALSE 2 f: 79460, t: 774
require_guest_phone_verification 0 1 FALSE 2 f: 78951, t: 1283

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
price_x 0 1 102.23 79.11 0 45 80 130.0 499 ▇▃▁▁▁
host_listings_count 0 1 19.51 109.40 0 1 1 4.0 1687 ▇▁▁▁▁
accommodates 0 1 3.12 1.92 1 2 2 4.0 16 ▇▂▁▁▁
bathrooms 0 1 1.29 0.58 0 1 1 1.5 35 ▇▁▁▁▁
bedrooms 0 1 1.38 0.88 0 1 1 2.0 50 ▇▁▁▁▁
beds 0 1 1.70 1.22 0 1 1 2.0 50 ▇▁▁▁▁
guests_included 0 1 1.57 1.25 1 1 1 2.0 46 ▇▁▁▁▁
extra_people 0 1 7.43 14.01 0 0 0 10.0 247 ▇▁▁▁▁
minimum_nights_y 0 1 4.60 19.28 1 1 2 3.0 1125 ▇▁▁▁▁
availability_30 0 1 4.84 5.50 0 0 0 11.0 30 ▇▅▁▁▁
availability_60 0 1 18.18 19.37 0 0 3 40.0 60 ▇▁▁▆▁
availability_90 0 1 31.99 33.09 0 0 14 70.0 90 ▇▁▁▅▁
availability_365_y 0 1 115.11 135.34 0 0 66 237.0 365 ▇▁▂▁▃
number_of_reviews_y 0 1 17.48 37.20 0 1 4 18.0 775 ▇▁▁▁▁
number_of_reviews_ltm 0 1 4.88 9.78 0 0 1 6.0 334 ▇▁▁▁▁
calculated_host_listings_count_y 0 1 15.80 81.36 1 1 1 4.0 921 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1 12.97 79.12 0 0 1 2.0 919 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1 2.17 12.89 0 0 1 1.0 223 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1 0.04 0.65 0 0 0 0.0 18 ▇▁▁▁▁
str(airbnb_data_fix)
## Classes 'data.table' and 'data.frame':   80234 obs. of  28 variables:
##  $ price_x                                     : int  88 65 100 300 150 29 100 195 72 80 ...
##  $ experiences_offered                         : Factor w/ 5 levels "business","family",..: 2 1 4 3 1 3 3 3 3 3 ...
##  $ host_is_superhost                           : Factor w/ 2 levels "f","t": 1 1 1 1 1 2 1 2 2 1 ...
##  $ host_listings_count                         : num  3 4 1 18 3 3 2 1 4 2 ...
##  $ host_has_profile_pic                        : Factor w/ 2 levels "f","t": 2 2 2 2 2 2 2 2 2 2 ...
##  $ host_identity_verified                      : Factor w/ 2 levels "f","t": 2 2 2 2 1 1 2 2 1 1 ...
##  $ is_location_exact                           : Factor w/ 2 levels "f","t": 2 2 2 1 2 2 2 1 1 1 ...
##  $ room_type_y                                 : Factor w/ 4 levels "Entire home/apt",..: 1 3 1 1 3 3 3 1 3 1 ...
##  $ accommodates                                : int  4 2 2 6 2 2 2 5 2 2 ...
##  $ bathrooms                                   : num  1 1 1 2 1 1.5 1 1.5 0 1 ...
##  $ bedrooms                                    : num  1 1 1 3 1 1 1 3 1 1 ...
##  $ beds                                        : num  3 0 1 3 1 0 1 3 0 1 ...
##  $ guests_included                             : int  2 1 2 3 1 1 1 1 1 1 ...
##  $ extra_people                                : num  20 15 0 10 0 8 0 0 0 0 ...
##  $ minimum_nights_y                            : int  2 1 10 3 3 10 1 3 2 6 ...
##  $ availability_30                             : int  12 12 0 9 11 13 13 10 11 11 ...
##  $ availability_60                             : int  42 42 0 39 41 43 43 40 38 41 ...
##  $ availability_90                             : int  72 72 13 69 71 73 73 62 54 71 ...
##  $ availability_365_y                          : int  347 347 288 326 346 348 348 334 311 71 ...
##  $ number_of_reviews_y                         : int  192 21 89 42 0 129 6 77 528 52 ...
##  $ number_of_reviews_ltm                       : int  9 5 4 2 0 8 1 10 27 0 ...
##  $ instant_bookable                            : Factor w/ 2 levels "f","t": 2 1 2 2 1 2 1 1 2 1 ...
##  $ require_guest_profile_picture               : Factor w/ 2 levels "f","t": 1 1 2 1 1 1 1 1 1 1 ...
##  $ require_guest_phone_verification            : Factor w/ 2 levels "f","t": 2 1 2 1 1 1 1 1 1 1 ...
##  $ calculated_host_listings_count_y            : int  2 3 1 15 2 3 2 1 4 1 ...
##  $ calculated_host_listings_count_entire_homes : int  2 1 1 15 0 0 1 1 0 1 ...
##  $ calculated_host_listings_count_private_rooms: int  0 2 0 0 2 3 1 0 4 0 ...
##  $ calculated_host_listings_count_shared_rooms : int  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>

In this step I split my dataset in to training and test groups. While doing that I leave 70% in training set and 30% in the test set.

airbnb_data_omit2 <- na.omit(airbnb_data_fix)

airbnb_data_omit2 = airbnb_data_omit2[sample(nrow(airbnb_data_omit2), 4000), ]


set.seed(123)

split_airbnb <- createDataPartition(airbnb_data_omit2$`price_x`, p = .7, 
                                     list = FALSE, 
                                     times = 1)

airbnb_train <- airbnb_data_omit2[split_airbnb,]
airbnb_test <- airbnb_data_omit2[-split_airbnb,]

3.2 Penalized Regression Approaches

airbnb_train_matrix <-  data.matrix(airbnb_train[,-c("price_x")])
airbnb_test_matrix <-  data.matrix(airbnb_test[,-c("price_x")])
airbnb_target_train <- data.matrix(airbnb_train$price_x)

airbnb_PRA_cv = cv.glmnet(airbnb_train_matrix, airbnb_target_train , family="gaussian")

plot(airbnb_PRA_cv)

airbnb_lambda_min = airbnb_PRA_cv$lambda.min
airbnb_lambda_min
## [1] 0.4285138
airbnb_lambda_1se = airbnb_PRA_cv$lambda.1se
airbnb_lambda_1se
## [1] 7.66461

I saw that there in no great difference in complexity with min lambda and 1se lambda so, I decided to use min lambda to build the model.

airbnb_PRA <- glmnet(airbnb_train_matrix, as.matrix(airbnb_train$price_x), family="gaussian", alpha=1, lambda= airbnb_PRA_cv$lambda.1se)

summary(airbnb_PRA)
##           Length Class     Mode   
## a0         1     -none-    numeric
## beta      27     dgCMatrix S4     
## df         1     -none-    numeric
## dim        2     -none-    numeric
## lambda     1     -none-    numeric
## dev.ratio  1     -none-    numeric
## nulldev    1     -none-    numeric
## npasses    1     -none-    numeric
## jerr       1     -none-    numeric
## offset     1     -none-    logical
## call       6     -none-    call   
## nobs       1     -none-    numeric

3.2 Decision Tree

airbnb_tune_grid = expand.grid(min_leaf_obs=seq(2,5,1),complexity=seq(0,0.03,0.01)) 


folds <- createFolds(1:nrow(airbnb_train), k = 10)
airbnb_cv_data = airbnb_train

airbnb_cv_param = tibble()
airbnb_all_cv_stat = data.table()

set.seed(123)
for(p in 1:nrow(airbnb_tune_grid)) {
  airbnb_temp_param = airbnb_tune_grid[p,]
  airbnb_temp_result = tibble()
  for(i in 1:10){
    airbnb_temp_test = airbnb_cv_data[folds[[i]],]
    airbnb_temp_train = airbnb_cv_data[-folds[[i]],]
    airbnb_temp_fit=rpart(price_x~.,data = airbnb_temp_train, method="anova", control = rpart.control(minbucket = airbnb_temp_param$min_leaf_obs,cp=airbnb_temp_param$complexity))
    airbnb_temp_test$Prediction = predict(airbnb_temp_fit,airbnb_temp_test)
    airbnb_temp_result=rbind(airbnb_temp_result,airbnb_temp_test)
  }
  airbnb_temp_stat = data.table(airbnb_temp_param, Accuracy= sum (airbnb_temp_test$Prediction==airbnb_temp_test$price_x)/nrow(airbnb_temp_test))
  print(airbnb_temp_stat)
  airbnb_all_cv_stat = rbind(airbnb_all_cv_stat, airbnb_temp_stat)
}
##    min_leaf_obs complexity    Accuracy
## 1:            2          0 0.007142857
##    min_leaf_obs complexity    Accuracy
## 1:            3          0 0.007142857
##    min_leaf_obs complexity    Accuracy
## 1:            4          0 0.007142857
##    min_leaf_obs complexity    Accuracy
## 1:            5          0 0.003571429
##    min_leaf_obs complexity Accuracy
## 1:            2       0.01        0
##    min_leaf_obs complexity Accuracy
## 1:            3       0.01        0
##    min_leaf_obs complexity Accuracy
## 1:            4       0.01        0
##    min_leaf_obs complexity Accuracy
## 1:            5       0.01        0
##    min_leaf_obs complexity Accuracy
## 1:            2       0.02        0
##    min_leaf_obs complexity Accuracy
## 1:            3       0.02        0
##    min_leaf_obs complexity Accuracy
## 1:            4       0.02        0
##    min_leaf_obs complexity Accuracy
## 1:            5       0.02        0
##    min_leaf_obs complexity Accuracy
## 1:            2       0.03        0
##    min_leaf_obs complexity Accuracy
## 1:            3       0.03        0
##    min_leaf_obs complexity Accuracy
## 1:            4       0.03        0
##    min_leaf_obs complexity Accuracy
## 1:            5       0.03        0

According to the cross validation results the best accuracy was obtained with the following parameters.

airbnb_best_DT = airbnb_all_cv_stat%>%arrange(-Accuracy)%>%head(1)
airbnb_best_DT
##    min_leaf_obs complexity    Accuracy
## 1:            2          0 0.007142857

Then I use these parameters that perform best for building the CART tree.

airbnb_DT = rpart(price_x~.,data = airbnb_train, method="anova",control = rpart.control(minbucket = airbnb_best_DT$min_leaf_obs,cp=airbnb_best_DT$complexity))

fancyRpartPlot(airbnb_DT)

A very complex tree was built there may be overfitting.

3.3 Random Forests

For RF we set only m (set other parameters as J=500 and the minimal number of observations per tree leaf=5)

For mtry square root of features are taken generally. As there are 30 variables I am doing cross validation for 5, 6 and 7.

set.seed(123)
airbnb_m=c(5,6,7)
J=500 
min_size_of_terminal_nodes=5
n_folds=10

airbnb_RF_fitControl=trainControl(method = "repeatedcv",
                           number = n_folds,
                           repeats = 1, search="random")  

airbnb_RF_grid=expand.grid(mtry = airbnb_m)

airbnb_RF_train <- train(price_x~., data = airbnb_train, method = "rf", trControl = airbnb_RF_fitControl, ntree=J, nodesize = min_size_of_terminal_nodes, tuneGrid = airbnb_RF_grid)

print(airbnb_RF_train[["results"]])
##   mtry     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1    5 51.21600 0.5960443 33.77423 6.536677 0.06448376 2.846482
## 2    6 51.10840 0.5975845 33.65839 6.441859 0.06435204 2.813055
## 3    7 51.12061 0.5970414 33.69530 6.507759 0.06396738 2.791607

3.4 Stochastic Gradient Boosting

set.seed(123)

airbnb_GBM_control <- trainControl(method='cv',  number=5,  search='grid')


airbnb_GBM_grid <-  expand.grid(interaction.depth = c(2, 3, 5),  n.trees = c(50,100),  shrinkage = c(0.05,0.1), n.minobsinnode = 10)


print(airbnb_GBM_grid)
##    interaction.depth n.trees shrinkage n.minobsinnode
## 1                  2      50      0.05             10
## 2                  3      50      0.05             10
## 3                  5      50      0.05             10
## 4                  2     100      0.05             10
## 5                  3     100      0.05             10
## 6                  5     100      0.05             10
## 7                  2      50      0.10             10
## 8                  3      50      0.10             10
## 9                  5      50      0.10             10
## 10                 2     100      0.10             10
## 11                 3     100      0.10             10
## 12                 5     100      0.10             10
set.seed(123)
airbnb_GBM_train <- train(price_x~., 
                 data = airbnb_train,
                 method = 'gbm',
                 metric = 'RMSE',
                 trControl=airbnb_GBM_control,
                 tuneGrid = airbnb_GBM_grid, 
                 verbose=FALSE)

print(airbnb_GBM_train)
## Stochastic Gradient Boosting 
## 
## 2802 samples
##   27 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 2242, 2242, 2239, 2243, 2242 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE      Rsquared   MAE     
##   0.05       2                   50      55.58109  0.5470749  37.32496
##   0.05       2                  100      53.10155  0.5680287  35.12149
##   0.05       3                   50      54.21046  0.5634243  36.24254
##   0.05       3                  100      52.03683  0.5837548  34.25918
##   0.05       5                   50      53.19897  0.5741560  35.39588
##   0.05       5                  100      51.28301  0.5944135  33.59028
##   0.10       2                   50      52.87808  0.5722498  35.00836
##   0.10       2                  100      51.43884  0.5917479  33.81389
##   0.10       3                   50      52.13886  0.5817968  34.26897
##   0.10       3                  100      51.08181  0.5974978  33.41614
##   0.10       5                   50      51.57281  0.5893508  33.90920
##   0.10       5                  100      51.02036  0.5988958  33.39772
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 100, interaction.depth =
##  5, shrinkage = 0.1 and n.minobsinnode = 10.
summary(airbnb_GBM_train)

##                                                                                       var
## accommodates                                                                 accommodates
## bedrooms                                                                         bedrooms
## bathrooms                                                                       bathrooms
## calculated_host_listings_count_entire_homes   calculated_host_listings_count_entire_homes
## room_type_yPrivate room                                           room_type_yPrivate room
## host_listings_count                                                   host_listings_count
## extra_people                                                                 extra_people
## calculated_host_listings_count_y                         calculated_host_listings_count_y
## availability_365_y                                                     availability_365_y
## minimum_nights_y                                                         minimum_nights_y
## guests_included                                                           guests_included
## number_of_reviews_y                                                   number_of_reviews_y
## availability_30                                                           availability_30
## beds                                                                                 beds
## availability_90                                                           availability_90
## number_of_reviews_ltm                                               number_of_reviews_ltm
## availability_60                                                           availability_60
## instant_bookablet                                                       instant_bookablet
## calculated_host_listings_count_private_rooms calculated_host_listings_count_private_rooms
## room_type_yShared room                                             room_type_yShared room
## is_location_exactt                                                     is_location_exactt
## calculated_host_listings_count_shared_rooms   calculated_host_listings_count_shared_rooms
## host_is_superhostt                                                     host_is_superhostt
## host_identity_verifiedt                                           host_identity_verifiedt
## experiences_offeredfamily                                       experiences_offeredfamily
## experiences_offerednone                                           experiences_offerednone
## experiences_offeredromantic                                   experiences_offeredromantic
## experiences_offeredsocial                                       experiences_offeredsocial
## host_has_profile_pict                                               host_has_profile_pict
## room_type_yHotel room                                               room_type_yHotel room
## require_guest_profile_picturet                             require_guest_profile_picturet
## require_guest_phone_verificationt                       require_guest_phone_verificationt
##                                                 rel.inf
## accommodates                                 34.4415463
## bedrooms                                     11.2423056
## bathrooms                                    10.3089723
## calculated_host_listings_count_entire_homes   9.3330259
## room_type_yPrivate room                       8.3037269
## host_listings_count                           5.0822816
## extra_people                                  3.6024284
## calculated_host_listings_count_y              3.3069958
## availability_365_y                            3.1281807
## minimum_nights_y                              2.0652280
## guests_included                               1.6685606
## number_of_reviews_y                           1.3469511
## availability_30                               0.9935921
## beds                                          0.9179146
## availability_90                               0.9048866
## number_of_reviews_ltm                         0.8121054
## availability_60                               0.7607171
## instant_bookablet                             0.4349347
## calculated_host_listings_count_private_rooms  0.4181749
## room_type_yShared room                        0.2449279
## is_location_exactt                            0.2267264
## calculated_host_listings_count_shared_rooms   0.2027996
## host_is_superhostt                            0.1391778
## host_identity_verifiedt                       0.1138399
## experiences_offeredfamily                     0.0000000
## experiences_offerednone                       0.0000000
## experiences_offeredromantic                   0.0000000
## experiences_offeredsocial                     0.0000000
## host_has_profile_pict                         0.0000000
## room_type_yHotel room                         0.0000000
## require_guest_profile_picturet                0.0000000
## require_guest_phone_verificationt             0.0000000

3.5 Comparison

airbnb_PRA_predictions_train <- predict(airbnb_PRA_cv, airbnb_train_matrix, s='lambda.1se')
airbnb_PRA_predictions_test <- predict(airbnb_PRA_cv, airbnb_test_matrix, s='lambda.1se')

airbnb_PRA_Rsquared_train <- cor(airbnb_PRA_predictions_train, airbnb_train$price_x) ^ 2
airbnb_PRA_Rsquared_test <- cor(airbnb_PRA_predictions_test, airbnb_test$price_x) ^ 2


airbnb_DT_predictions_train <- predict(airbnb_DT, airbnb_train[,-c("price_x")], type="vector")
airbnb_DT_predictions_test <- predict(airbnb_DT, airbnb_test[,-c("price_x")], type="vector")

airbnb_DT_Rsquared_train <- cor(airbnb_DT_predictions_train, airbnb_train$price_x) ^ 2
airbnb_DT_Rsquared_test <- cor(airbnb_DT_predictions_test, airbnb_test$price_x) ^ 2


airbnb_RF_predictions_train  <- predict(airbnb_RF_train, airbnb_train, type="raw")
airbnb_RF_predictions_test  <- predict(airbnb_RF_train, airbnb_test, type="raw")

airbnb_RF_Rsquared_train <- cor(airbnb_RF_predictions_train , airbnb_train$price_x) ^ 2
airbnb_RF_Rsquared_test <- cor(airbnb_RF_predictions_test, airbnb_test$price_x) ^ 2


airbnb_SGB_predictions_train <- predict(airbnb_GBM_train, airbnb_train, type="raw")
airbnb_SGB_predictions_test <- predict(airbnb_GBM_train, airbnb_test, type="raw")

airbnb_SGB_Rsquared_train <- cor(airbnb_SGB_predictions_train , airbnb_train$price_x) ^ 2
airbnb_SGB_Rsquared_test <- cor(airbnb_SGB_predictions_test, airbnb_test$price_x) ^ 2

airbnb_test_performances <- data.table(airbnb_PRA=airbnb_PRA_Rsquared_test, airbnb_DT_Rsquared_test, airbnb_RF_Rsquared_test, airbnb_SGB_Rsquared_test)

airbnb_train_performances <- data.table(airbnb_PRA=airbnb_PRA_Rsquared_train, airbnb_DT_Rsquared_train, airbnb_RF_Rsquared_train, airbnb_SGB_Rsquared_train)

airbnb_test_performances
##    airbnb_PRA.V1 airbnb_DT_Rsquared_test airbnb_RF_Rsquared_test
## 1:     0.4898242               0.3559494               0.5821181
##    airbnb_SGB_Rsquared_test
## 1:                0.5923573
airbnb_train_performances
##    airbnb_PRA.V1 airbnb_DT_Rsquared_train airbnb_RF_Rsquared_train
## 1:     0.5209606                0.8971872                0.9088639
##    airbnb_SGB_Rsquared_train
## 1:                 0.6956277
  • Overall, r squared values are low for each method.
  • Random forest has highest r squared value. But there is a great variation with test and train performance. Therefore SGB is a better model for this data set.
  • Decision tree performed worst on the test data. There is a big difference on test and train data. This shows overfitting.
  • I expected penalized regresion approach to work well for this data because target is continuos. And with tree based approaches this can be a problem. Non the less PRA performed bad. This may be result of non linear realations. Linearity assumtions are probably not fulfilled. Because when I checked the histogram of airbnb house prices the distribution looked more like poisson distribution.

4. FOREST DATASET

4.1 Data Manipulation

While I was exploring the data I saw that all variables were

colnames(forest_data) <- make.names(colnames(forest_data))

forest_data_factors <- forest_data %>% 
  select(11:54)

forest_factors_names <- colnames(forest_data_factors)
forest_factors_names
##  [1] "Wilderness_Area1" "Wilderness_Area2" "Wilderness_Area3" "Wilderness_Area4"
##  [5] "Soil_Type1"       "Soil_Type2"       "Soil_Type3"       "Soil_Type4"      
##  [9] "Soil_Type5"       "Soil_Type6"       "Soil_Type7"       "Soil_Type8"      
## [13] "Soil_Type9"       "Soil_Type10"      "Soil_Type11"      "Soil_Type12"     
## [17] "Soil_Type13"      "Soil_Type14"      "Soil_Type15"      "Soil_Type16"     
## [21] "Soil_Type17"      "Soil_Type18"      "Soil_Type19"      "Soil_Type20"     
## [25] "Soil_Type21"      "Soil_Type22"      "Soil_Type23"      "Soil_Type24"     
## [29] "Soil_Type25"      "Soil_Type26"      "Soil_Type27"      "Soil_Type28"     
## [33] "Soil_Type29"      "Soil_Type30"      "Soil_Type31"      "Soil_Type32"     
## [37] "Soil_Type33"      "Soil_Type34"      "Soil_Type35"      "Soil_Type36"     
## [41] "Soil_Type37"      "Soil_Type38"      "Soil_Type39"      "Soil_Type40"
forest_data_fix <- forest_data %>% 
  mutate_at(c("Wilderness_Area1", "Wilderness_Area2", "Wilderness_Area3" ,"Wilderness_Area4" ,"Soil_Type1", "Soil_Type2",      "Soil_Type3",  "Soil_Type4",  "Soil_Type5", "Soil_Type6", "Soil_Type7",  "Soil_Type8" , "Soil_Type9",  "Soil_Type10" ,     "Soil_Type11", "Soil_Type12", "Soil_Type13"  ,"Soil_Type14", "Soil_Type15" ,  "Soil_Type16"  , "Soil_Type17",  "Soil_Type18", "Soil_Type19",  "Soil_Type20" ,  "Soil_Type21",  "Soil_Type22"  , "Soil_Type23"   ,"Soil_Type24",  "Soil_Type25",   "Soil_Type26" ,  "Soil_Type27" , "Soil_Type28"   ,"Soil_Type29",    "Soil_Type30", "Soil_Type31", "Soil_Type32", "Soil_Type33",    "Soil_Type34",    "Soil_Type35",  "Soil_Type36", "Soil_Type37"  ,    "Soil_Type38", "Soil_Type39", "Soil_Type40", "Cover_Type"), factor)

str(forest_data_fix)  
## Classes 'data.table' and 'data.frame':   581012 obs. of  55 variables:
##  $ Elevation                         : int  2596 2590 2804 2785 2595 2579 2606 2605 2617 2612 ...
##  $ Aspect                            : int  51 56 139 155 45 132 45 49 45 59 ...
##  $ Slope                             : int  3 2 9 18 2 6 7 4 9 10 ...
##  $ Horizontal_Distance_To_Hydrology  : int  258 212 268 242 153 300 270 234 240 247 ...
##  $ Vertical_Distance_To_Hydrology    : int  0 -6 65 118 -1 -15 5 7 56 11 ...
##  $ Horizontal_Distance_To_Roadways   : int  510 390 3180 3090 391 67 633 573 666 636 ...
##  $ Hillshade_9am                     : int  221 220 234 238 220 230 222 222 223 228 ...
##  $ Hillshade_Noon                    : int  232 235 238 238 234 237 225 230 221 219 ...
##  $ Hillshade_3pm                     : int  148 151 135 122 150 140 138 144 133 124 ...
##  $ Horizontal_Distance_To_Fire_Points: int  6279 6225 6121 6211 6172 6031 6256 6228 6244 6230 ...
##  $ Wilderness_Area1                  : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Wilderness_Area2                  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Wilderness_Area3                  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Wilderness_Area4                  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type1                        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type2                        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type3                        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type4                        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type5                        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type6                        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type7                        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type8                        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type9                        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type10                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type11                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type12                       : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
##  $ Soil_Type13                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type14                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type15                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type16                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type17                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type18                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type19                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type20                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type21                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type22                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type23                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type24                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type25                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type26                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type27                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type28                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type29                       : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 2 2 2 2 ...
##  $ Soil_Type30                       : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
##  $ Soil_Type31                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type32                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type33                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type34                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type35                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type36                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type37                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type38                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type39                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil_Type40                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Cover_Type                        : Factor w/ 7 levels "1","2","3","4",..: 5 5 2 2 5 2 5 5 5 5 ...
##  - attr(*, ".internal.selfref")=<externalptr>
skim(forest_data_fix)
Data summary
Name forest_data_fix
Number of rows 581012
Number of columns 55
_______________________
Column type frequency:
factor 45
numeric 10
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Wilderness_Area1 0 1 FALSE 2 0: 320216, 1: 260796
Wilderness_Area2 0 1 FALSE 2 0: 551128, 1: 29884
Wilderness_Area3 0 1 FALSE 2 0: 327648, 1: 253364
Wilderness_Area4 0 1 FALSE 2 0: 544044, 1: 36968
Soil_Type1 0 1 FALSE 2 0: 577981, 1: 3031
Soil_Type2 0 1 FALSE 2 0: 573487, 1: 7525
Soil_Type3 0 1 FALSE 2 0: 576189, 1: 4823
Soil_Type4 0 1 FALSE 2 0: 568616, 1: 12396
Soil_Type5 0 1 FALSE 2 0: 579415, 1: 1597
Soil_Type6 0 1 FALSE 2 0: 574437, 1: 6575
Soil_Type7 0 1 FALSE 2 0: 580907, 1: 105
Soil_Type8 0 1 FALSE 2 0: 580833, 1: 179
Soil_Type9 0 1 FALSE 2 0: 579865, 1: 1147
Soil_Type10 0 1 FALSE 2 0: 548378, 1: 32634
Soil_Type11 0 1 FALSE 2 0: 568602, 1: 12410
Soil_Type12 0 1 FALSE 2 0: 551041, 1: 29971
Soil_Type13 0 1 FALSE 2 0: 563581, 1: 17431
Soil_Type14 0 1 FALSE 2 0: 580413, 1: 599
Soil_Type15 0 1 FALSE 2 0: 581009, 1: 3
Soil_Type16 0 1 FALSE 2 0: 578167, 1: 2845
Soil_Type17 0 1 FALSE 2 0: 577590, 1: 3422
Soil_Type18 0 1 FALSE 2 0: 579113, 1: 1899
Soil_Type19 0 1 FALSE 2 0: 576991, 1: 4021
Soil_Type20 0 1 FALSE 2 0: 571753, 1: 9259
Soil_Type21 0 1 FALSE 2 0: 580174, 1: 838
Soil_Type22 0 1 FALSE 2 0: 547639, 1: 33373
Soil_Type23 0 1 FALSE 2 0: 523260, 1: 57752
Soil_Type24 0 1 FALSE 2 0: 559734, 1: 21278
Soil_Type25 0 1 FALSE 2 0: 580538, 1: 474
Soil_Type26 0 1 FALSE 2 0: 578423, 1: 2589
Soil_Type27 0 1 FALSE 2 0: 579926, 1: 1086
Soil_Type28 0 1 FALSE 2 0: 580066, 1: 946
Soil_Type29 0 1 FALSE 2 0: 465765, 1: 115247
Soil_Type30 0 1 FALSE 2 0: 550842, 1: 30170
Soil_Type31 0 1 FALSE 2 0: 555346, 1: 25666
Soil_Type32 0 1 FALSE 2 0: 528493, 1: 52519
Soil_Type33 0 1 FALSE 2 0: 535858, 1: 45154
Soil_Type34 0 1 FALSE 2 0: 579401, 1: 1611
Soil_Type35 0 1 FALSE 2 0: 579121, 1: 1891
Soil_Type36 0 1 FALSE 2 0: 580893, 1: 119
Soil_Type37 0 1 FALSE 2 0: 580714, 1: 298
Soil_Type38 0 1 FALSE 2 0: 565439, 1: 15573
Soil_Type39 0 1 FALSE 2 0: 567206, 1: 13806
Soil_Type40 0 1 FALSE 2 0: 572262, 1: 8750
Cover_Type 0 1 FALSE 7 2: 283301, 1: 211840, 3: 35754, 7: 20510

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Elevation 0 1 2959.37 279.98 1859 2809 2996 3163 3858 ▁▂▇▇▁
Aspect 0 1 155.66 111.91 0 58 127 260 360 ▇▆▃▃▅
Slope 0 1 14.10 7.49 0 9 13 18 66 ▇▆▁▁▁
Horizontal_Distance_To_Hydrology 0 1 269.43 212.55 0 108 218 384 1397 ▇▃▁▁▁
Vertical_Distance_To_Hydrology 0 1 46.42 58.30 -173 7 30 69 601 ▁▇▁▁▁
Horizontal_Distance_To_Roadways 0 1 2350.15 1559.25 0 1106 1997 3328 7117 ▇▇▅▂▁
Hillshade_9am 0 1 212.15 26.77 0 198 218 231 254 ▁▁▁▃▇
Hillshade_Noon 0 1 223.32 19.77 0 213 226 237 254 ▁▁▁▁▇
Hillshade_3pm 0 1 142.53 38.27 0 119 143 168 254 ▁▂▇▆▁
Horizontal_Distance_To_Fire_Points 0 1 1980.29 1324.20 0 1024 1710 2550 7173 ▇▇▂▁▁
table(forest_data_fix$Cover_Type)
## 
##      1      2      3      4      5      6      7 
## 211840 283301  35754   2747   9493  17367  20510

When I checked the data with skimr() and str() I saw that there are no missing values. Also after transforming dummy encodings from numerical to factor data is cleaned.

There are over 500.000 observations in my data set it was taking too long to builds models so I have sampled 4000 from it. After the sampling I seperated 70% of the data for train and 30% for testing.

forest_data_omit <- na.omit(forest_data_fix)

forest_data_omit = forest_data_omit[sample(nrow(forest_data_omit), 4000), ]

set.seed(12345)
split_forest <- createDataPartition(forest_data_omit$`Cover_Type`, p = .7, 
                                     list = FALSE, 
                                     times = 1)

forest_train <- forest_data_omit[split_forest,]
forest_test <- forest_data_omit[-split_forest,]

4.2 Penalized Regression

In penalized regression approach we need to do standardization. But our data has some dummy encodings. And when we take them to [-3, 3] range it doesn’t really make sense. So, I only applied standardization to numerical values. There will be still a little difference in scaling but it will be a lot better than what we started with. I have standardized the test and train data to be used in the penalized regression approach. In tree based approaches we do not need to use the standardized versions.

forest_train_standard <- forest_train %>% 
    mutate_if(is.integer, scale)

forest_test_standard <- forest_test %>% 
    mutate_if(is.integer, scale)
forest_train_matrix = matrix()
forest_test_matrix = matrix()
forest_target = matrix()

forest_train_matrix <-  data.matrix(forest_train_standard[,-c("Cover_Type")])
forest_test_matrix <-  data.matrix(forest_test_standard[,-c("Cover_Type")])
forest_target <- data.matrix(forest_train_standard$Cover_Type)

set.seed(123)
forest_PRA_cv = cv.glmnet(forest_train_matrix, forest_target, family="multinomial", nfolds=10)


forest_lambda_min = forest_PRA_cv$lambda.min
forest_lambda_min
## [1] 0.0009328935
forest_lambda_1se = forest_PRA_cv$lambda.1se
forest_lambda_1se
## [1] 0.003766107
plot(forest_PRA_cv)

I saw that there in no great difference in complexity with min lambda and 1se lambda so, I decided to use min lambda to build the model.

4.3 Decision Trees

Using 10 fold cross validation

forest_tune_grid = expand.grid(min_leaf_obs=seq(5,15,1),complexity=seq(0,0.3,0.1)) 


folds <- createFolds(1:nrow(forest_train), k = 10)
forest_cv_data = forest_train

forest_cv_param = tibble()
forest_all_cv_stat = data.table()

set.seed(123)
for(p in 1:nrow(forest_tune_grid)) {
  forest_temp_param = forest_tune_grid[p,]
  forest_temp_result = tibble()
  for(i in 1:10){
    forest_temp_test = forest_cv_data[folds[[i]],]
    forest_temp_train = forest_cv_data[-folds[[i]],]
    forest_temp_fit=rpart(Cover_Type~.,data = forest_temp_train, method="class", control = rpart.control(minbucket = forest_temp_param$min_leaf_obs,cp=forest_temp_param$complexity))
    forest_temp_test$Prediction = predict(forest_temp_fit,forest_temp_test,type="class")
    forest_temp_result=rbind(forest_temp_result,forest_temp_test)
  }
  forest_temp_stat = data.table(forest_temp_param, Accuracy= sum (forest_temp_test$Prediction==forest_temp_test$Cover_Type)/nrow(forest_temp_test))
  print(forest_temp_stat)
  forest_all_cv_stat = rbind(forest_all_cv_stat, forest_temp_stat)
}
##    min_leaf_obs complexity  Accuracy
## 1:            5          0 0.7071429
##    min_leaf_obs complexity  Accuracy
## 1:            6          0 0.7142857
##    min_leaf_obs complexity  Accuracy
## 1:            7          0 0.7071429
##    min_leaf_obs complexity  Accuracy
## 1:            8          0 0.7071429
##    min_leaf_obs complexity  Accuracy
## 1:            9          0 0.6928571
##    min_leaf_obs complexity  Accuracy
## 1:           10          0 0.6678571
##    min_leaf_obs complexity  Accuracy
## 1:           11          0 0.6785714
##    min_leaf_obs complexity  Accuracy
## 1:           12          0 0.6892857
##    min_leaf_obs complexity  Accuracy
## 1:           13          0 0.6928571
##    min_leaf_obs complexity  Accuracy
## 1:           14          0 0.7178571
##    min_leaf_obs complexity  Accuracy
## 1:           15          0 0.7071429
##    min_leaf_obs complexity  Accuracy
## 1:            5        0.1 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:            6        0.1 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:            7        0.1 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:            8        0.1 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:            9        0.1 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:           10        0.1 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:           11        0.1 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:           12        0.1 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:           13        0.1 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:           14        0.1 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:           15        0.1 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:            5        0.2 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:            6        0.2 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:            7        0.2 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:            8        0.2 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:            9        0.2 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:           10        0.2 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:           11        0.2 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:           12        0.2 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:           13        0.2 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:           14        0.2 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:           15        0.2 0.6607143
##    min_leaf_obs complexity  Accuracy
## 1:            5        0.3 0.4964286
##    min_leaf_obs complexity  Accuracy
## 1:            6        0.3 0.4964286
##    min_leaf_obs complexity  Accuracy
## 1:            7        0.3 0.4964286
##    min_leaf_obs complexity  Accuracy
## 1:            8        0.3 0.4964286
##    min_leaf_obs complexity  Accuracy
## 1:            9        0.3 0.4964286
##    min_leaf_obs complexity  Accuracy
## 1:           10        0.3 0.4964286
##    min_leaf_obs complexity  Accuracy
## 1:           11        0.3 0.4964286
##    min_leaf_obs complexity  Accuracy
## 1:           12        0.3 0.4964286
##    min_leaf_obs complexity  Accuracy
## 1:           13        0.3 0.4964286
##    min_leaf_obs complexity  Accuracy
## 1:           14        0.3 0.4964286
##    min_leaf_obs complexity  Accuracy
## 1:           15        0.3 0.4964286

According to the cross validation results the best accuracy was obtained with the following parameters.

forest_best_DT = forest_all_cv_stat%>%arrange(-Accuracy)%>%head(1)
forest_best_DT
##    min_leaf_obs complexity  Accuracy
## 1:           14          0 0.7178571

Then I use these parameters that perform best for building the CART tree.

forest_DT = rpart(Cover_Type~.,data = forest_train, method="class",control = rpart.control(minbucket = forest_best_DT$min_leaf_obs,cp=forest_best_DT$complexity))

fancyRpartPlot(forest_DT)

As we can see that the tree generated is not too complex.

4.4 Random Forests

For RF we set only m (set other parameters as J=500 and the minimal number of observations per tree leaf=5)

For mtry square root of features are taken generally. As there are 55 variables I am doing cross validation for 7, 8 and 9.

set.seed(123)
forest_m=c(7,8,9)
J=500 
min_size_of_terminal_nodes=5
n_folds=10

forest_RF_fitControl=trainControl(method = "repeatedcv",
                           number = n_folds,
                           repeats = 1, classProbs = TRUE, search="random")  

forest_RF_grid=expand.grid(mtry = forest_m)

forest_RF_train <- train(make.names(as.factor(Cover_Type)) ~ ., data = forest_train, method = "rf", trControl = forest_RF_fitControl, ntree=J, nodesize = min_size_of_terminal_nodes, tuneGrid = forest_RF_grid)

print(forest_RF_train[["results"]])
##   mtry  Accuracy     Kappa AccuracySD    KappaSD
## 1    7 0.7393034 0.5715715 0.02120008 0.03526393
## 2    8 0.7453787 0.5820367 0.02250172 0.03686341
## 3    9 0.7500178 0.5903403 0.02248470 0.03678760

4.5 Stochastic Gradient Boosting

set.seed(123)
forest_GBM_Control <- trainControl(method='cv', 
                        number=5, 
                        search='grid')

forest_GBM_Grid <-  expand.grid(interaction.depth = c(2, 3, 5),
                         n.trees = c(50,100), 
                         shrinkage = c(0.05, 0.1),
                         n.minobsinnode = 10)

print(forest_GBM_Grid)
##    interaction.depth n.trees shrinkage n.minobsinnode
## 1                  2      50      0.05             10
## 2                  3      50      0.05             10
## 3                  5      50      0.05             10
## 4                  2     100      0.05             10
## 5                  3     100      0.05             10
## 6                  5     100      0.05             10
## 7                  2      50      0.10             10
## 8                  3      50      0.10             10
## 9                  5      50      0.10             10
## 10                 2     100      0.10             10
## 11                 3     100      0.10             10
## 12                 5     100      0.10             10
forest_GBM_train <- train(Cover_Type ~ ., 
                 data =forest_train,
                 method = 'gbm',
                 metric = 'Accuracy',
                 trControl=forest_GBM_Control,
                 tuneGrid = forest_GBM_Grid,
                 verbose=FALSE)

print(forest_GBM_train)
## Stochastic Gradient Boosting 
## 
## 2804 samples
##   54 predictor
##    7 classes: '1', '2', '3', '4', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 2243, 2245, 2244, 2243, 2241 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  Accuracy   Kappa    
##   0.05       2                   50      0.7072362  0.5156248
##   0.05       2                  100      0.7243537  0.5483239
##   0.05       3                   50      0.7115016  0.5252034
##   0.05       3                  100      0.7189895  0.5406846
##   0.05       5                   50      0.7232645  0.5452101
##   0.05       5                  100      0.7328991  0.5649700
##   0.10       2                   50      0.7222058  0.5453391
##   0.10       2                  100      0.7243511  0.5504780
##   0.10       3                   50      0.7232714  0.5493122
##   0.10       3                  100      0.7257631  0.5543864
##   0.10       5                   50      0.7257543  0.5524149
##   0.10       5                  100      0.7303844  0.5619303
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100, interaction.depth =
##  5, shrinkage = 0.05 and n.minobsinnode = 10.
summary(forest_GBM_train)

##                                                                   var
## Elevation                                                   Elevation
## Horizontal_Distance_To_Roadways       Horizontal_Distance_To_Roadways
## Horizontal_Distance_To_Hydrology     Horizontal_Distance_To_Hydrology
## Horizontal_Distance_To_Fire_Points Horizontal_Distance_To_Fire_Points
## Hillshade_Noon                                         Hillshade_Noon
## Hillshade_9am                                           Hillshade_9am
## Vertical_Distance_To_Hydrology         Vertical_Distance_To_Hydrology
## Hillshade_3pm                                           Hillshade_3pm
## Aspect                                                         Aspect
## Slope                                                           Slope
## Soil_Type41                                               Soil_Type41
## Soil_Type221                                             Soil_Type221
## Soil_Type21                                               Soil_Type21
## Wilderness_Area11                                   Wilderness_Area11
## Wilderness_Area31                                   Wilderness_Area31
## Soil_Type231                                             Soil_Type231
## Soil_Type381                                             Soil_Type381
## Soil_Type101                                             Soil_Type101
## Soil_Type391                                             Soil_Type391
## Soil_Type171                                             Soil_Type171
## Soil_Type201                                             Soil_Type201
## Soil_Type121                                             Soil_Type121
## Soil_Type331                                             Soil_Type331
## Soil_Type321                                             Soil_Type321
## Soil_Type131                                             Soil_Type131
## Soil_Type301                                             Soil_Type301
## Soil_Type291                                             Soil_Type291
## Soil_Type61                                               Soil_Type61
## Soil_Type31                                               Soil_Type31
## Soil_Type401                                             Soil_Type401
## Soil_Type111                                             Soil_Type111
## Wilderness_Area41                                   Wilderness_Area41
## Wilderness_Area21                                   Wilderness_Area21
## Soil_Type311                                             Soil_Type311
## Soil_Type241                                             Soil_Type241
## Soil_Type11                                               Soil_Type11
## Soil_Type51                                               Soil_Type51
## Soil_Type71                                               Soil_Type71
## Soil_Type81                                               Soil_Type81
## Soil_Type91                                               Soil_Type91
## Soil_Type141                                             Soil_Type141
## Soil_Type151                                             Soil_Type151
## Soil_Type161                                             Soil_Type161
## Soil_Type181                                             Soil_Type181
## Soil_Type191                                             Soil_Type191
## Soil_Type211                                             Soil_Type211
## Soil_Type251                                             Soil_Type251
## Soil_Type261                                             Soil_Type261
## Soil_Type271                                             Soil_Type271
## Soil_Type281                                             Soil_Type281
## Soil_Type341                                             Soil_Type341
## Soil_Type351                                             Soil_Type351
## Soil_Type361                                             Soil_Type361
## Soil_Type371                                             Soil_Type371
##                                        rel.inf
## Elevation                          49.29781390
## Horizontal_Distance_To_Roadways     7.68576519
## Horizontal_Distance_To_Hydrology    5.37036421
## Horizontal_Distance_To_Fire_Points  5.33723021
## Hillshade_Noon                      4.30047688
## Hillshade_9am                       3.78463079
## Vertical_Distance_To_Hydrology      3.71051014
## Hillshade_3pm                       2.60356599
## Aspect                              2.45809099
## Slope                               2.13396677
## Soil_Type41                         1.96946027
## Soil_Type221                        1.49068372
## Soil_Type21                         1.30102109
## Wilderness_Area11                   1.29960806
## Wilderness_Area31                   1.16227898
## Soil_Type231                        0.76412633
## Soil_Type381                        0.71402480
## Soil_Type101                        0.63877705
## Soil_Type391                        0.48653901
## Soil_Type171                        0.37445306
## Soil_Type201                        0.36759796
## Soil_Type121                        0.33314054
## Soil_Type331                        0.33289954
## Soil_Type321                        0.32347011
## Soil_Type131                        0.31590870
## Soil_Type301                        0.27391148
## Soil_Type291                        0.24281779
## Soil_Type61                         0.17044218
## Soil_Type31                         0.16585563
## Soil_Type401                        0.14155309
## Soil_Type111                        0.13348599
## Wilderness_Area41                   0.13202062
## Wilderness_Area21                   0.10468099
## Soil_Type311                        0.05241909
## Soil_Type241                        0.02640886
## Soil_Type11                         0.00000000
## Soil_Type51                         0.00000000
## Soil_Type71                         0.00000000
## Soil_Type81                         0.00000000
## Soil_Type91                         0.00000000
## Soil_Type141                        0.00000000
## Soil_Type151                        0.00000000
## Soil_Type161                        0.00000000
## Soil_Type181                        0.00000000
## Soil_Type191                        0.00000000
## Soil_Type211                        0.00000000
## Soil_Type251                        0.00000000
## Soil_Type261                        0.00000000
## Soil_Type271                        0.00000000
## Soil_Type281                        0.00000000
## Soil_Type341                        0.00000000
## Soil_Type351                        0.00000000
## Soil_Type361                        0.00000000
## Soil_Type371                        0.00000000

4.6 Comparison

prediction_forest_PRA_test <- predict(forest_PRA_cv, newx=forest_test_matrix, s=c("lambda.min"), type="class")
prediction_forest_PRA_train <-  predict(forest_PRA_cv, newx=forest_train_matrix, s=c("lambda.min"), type="class")

cm_forest_PRA_test <- confusionMatrix(factor(prediction_forest_PRA_test), factor(forest_test$Cover_Type))
cm_forest_PRA_train <- confusionMatrix(factor(prediction_forest_PRA_train), factor(forest_train$Cover_Type))


prediction_forest_DT_test <- predict(forest_DT, forest_test, type="class")
prediction_forest_DT_train <- predict(forest_DT, forest_train, type="class")

cm_forest_DT_test <- confusionMatrix(factor(prediction_forest_DT_test), factor(forest_test$Cover_Type))
cm_forest_DT_train <- confusionMatrix(factor(prediction_forest_DT_train), factor(forest_train$Cover_Type))


prediction_forest_RF_test <- predict(forest_RF_train, forest_test, type="raw")
prediction_forest_RF_train <- predict(forest_RF_train, forest_train, type="raw")

cm_forest_RF_test <- confusionMatrix(factor(prediction_forest_RF_test), factor(make.names(forest_test$Cover_Type)))
cm_forest_RF_train <- confusionMatrix(factor(prediction_forest_RF_train), factor(make.names(forest_train$Cover_Type)))


prediction_forest_SGB_test <- predict(forest_GBM_train, forest_test, type="raw")
prediction_forest_SGB_train <- predict(forest_GBM_train, forest_train, type="raw")

cm_forest_SGB_test <- confusionMatrix(factor(make.names(prediction_forest_SGB_test)), factor(make.names(forest_test$Cover_Type)))
cm_forest_SGB_train <- confusionMatrix(factor(make.names(prediction_forest_SGB_train)), factor(make.names(forest_train$Cover_Type)))


forest_test_performances <- data.table(PRA_test=cm_forest_PRA_test$overall["Accuracy"], DT_test=cm_forest_DT_test$overall["Accuracy"], RF_test=cm_forest_RF_test$overall["Accuracy"], SGB_test=cm_forest_SGB_test$overall["Accuracy"])

forest_train_performances <- data.table(PRA_train=cm_forest_PRA_train$overall["Accuracy"], DT_train=cm_forest_DT_train$overall["Accuracy"], RF_train=cm_forest_RF_train$overall["Accuracy"], SGB_train=cm_forest_SGB_train$overall["Accuracy"])

forest_test_performances
##     PRA_test   DT_test   RF_test  SGB_test
## 1: 0.7006689 0.7073579 0.7399666 0.7249164
forest_train_performances
##    PRA_train DT_train  RF_train SGB_train
## 1:  0.735378 0.767475 0.9432953 0.8252496
  • Overall all methods performed well on this dataset even though there were over 50 features.
  • Best performing method was Random Forest.
  • Tree based approaches performed really well in train data set. But there are big differences with test and train performance. This may be due to overfitting.
  • Penalized regression performed worst. There may be nonlinearity in the data.
  • But test and train accuracy is very close to each other it shows a balance with variance bias trade off.
  • Elevation is found important in all models and it used in a lot for splits in tree based approaches.There maybe a high cardinally issue there.

5. HR DATASET

5.1 Data Manipulation

HR data set is a binary dataset and it has 5:1 class imbalance

ggplot(HR_data, aes(x=Attrition )) + geom_bar()

When I checked my data with lapply(HR_data, unique) I saw that EmployeeCount, Over18, StandardHours variables had only one unique value. As there is no variation in those features I can remove them.

colnames(HR_data) <- make.names(colnames(HR_data))

HR_data<- HR_data %>% 
    mutate_if(is.character, factor)

HR_data_drop <- select(HR_data, -c("EmployeeCount", "Over18", "StandardHours"))

When I was exploring the data I saw that some categorical variables stored as integer even though there were only 2-3 type of levels in them. So, I filtered those variables and transformed them to factors.

HR_data_factors <- HR_data_drop %>% 
  select_if(~n_distinct(.) <= 9)

colnames(HR_data_factors)
##  [1] "Attrition"                "BusinessTravel"          
##  [3] "Department"               "Education"               
##  [5] "EducationField"           "EnvironmentSatisfaction" 
##  [7] "Gender"                   "JobInvolvement"          
##  [9] "JobLevel"                 "JobRole"                 
## [11] "JobSatisfaction"          "MaritalStatus"           
## [13] "OverTime"                 "PerformanceRating"       
## [15] "RelationshipSatisfaction" "StockOptionLevel"        
## [17] "TrainingTimesLastYear"    "WorkLifeBalance"
HR_data_factors_fix  <- HR_data_factors  %>% 
  select_if(is.integer)

HR_data_fix <- HR_data_drop %>% 
  mutate_at(c( "Education", "EnvironmentSatisfaction", "JobInvolvement", "JobLevel", "JobSatisfaction", "PerformanceRating", "RelationshipSatisfaction", "StockOptionLevel", "TrainingTimesLastYear", "WorkLifeBalance"), factor)

str(HR_data_fix)
## Classes 'data.table' and 'data.frame':   1470 obs. of  32 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : Factor w/ 5 levels "1","2","3","4",..: 2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EmployeeNumber          : int  1 2 4 5 7 8 10 11 12 13 ...
##  $ EnvironmentSatisfaction : Factor w/ 4 levels "1","2","3","4": 2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : Factor w/ 4 levels "1","2","3","4": 3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : Factor w/ 5 levels "1","2","3","4",..: 2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : Factor w/ 4 levels "1","2","3","4": 4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : Factor w/ 2 levels "3","4": 1 2 1 1 1 1 2 2 2 1 ...
##  $ RelationshipSatisfaction: Factor w/ 4 levels "1","2","3","4": 1 4 2 3 4 3 1 2 2 2 ...
##  $ StockOptionLevel        : Factor w/ 4 levels "0","1","2","3": 1 2 1 1 2 1 4 2 1 3 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : Factor w/ 7 levels "0","1","2","3",..: 1 4 4 4 4 3 4 3 3 4 ...
##  $ WorkLifeBalance         : Factor w/ 4 levels "1","2","3","4": 1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...
##  - attr(*, ".internal.selfref")=<externalptr>
skim(HR_data_fix)
Data summary
Name HR_data_fix
Number of rows 1470
Number of columns 32
_______________________
Column type frequency:
factor 18
numeric 14
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Attrition 0 1 FALSE 2 No: 1233, Yes: 237
BusinessTravel 0 1 FALSE 3 Tra: 1043, Tra: 277, Non: 150
Department 0 1 FALSE 3 Res: 961, Sal: 446, Hum: 63
Education 0 1 FALSE 5 3: 572, 4: 398, 2: 282, 1: 170
EducationField 0 1 FALSE 6 Lif: 606, Med: 464, Mar: 159, Tec: 132
EnvironmentSatisfaction 0 1 FALSE 4 3: 453, 4: 446, 2: 287, 1: 284
Gender 0 1 FALSE 2 Mal: 882, Fem: 588
JobInvolvement 0 1 FALSE 4 3: 868, 2: 375, 4: 144, 1: 83
JobLevel 0 1 FALSE 5 1: 543, 2: 534, 3: 218, 4: 106
JobRole 0 1 FALSE 9 Sal: 326, Res: 292, Lab: 259, Man: 145
JobSatisfaction 0 1 FALSE 4 4: 459, 3: 442, 1: 289, 2: 280
MaritalStatus 0 1 FALSE 3 Mar: 673, Sin: 470, Div: 327
OverTime 0 1 FALSE 2 No: 1054, Yes: 416
PerformanceRating 0 1 FALSE 2 3: 1244, 4: 226
RelationshipSatisfaction 0 1 FALSE 4 3: 459, 4: 432, 2: 303, 1: 276
StockOptionLevel 0 1 FALSE 4 0: 631, 1: 596, 2: 158, 3: 85
TrainingTimesLastYear 0 1 FALSE 7 2: 547, 3: 491, 4: 123, 5: 119
WorkLifeBalance 0 1 FALSE 4 3: 893, 2: 344, 4: 153, 1: 80

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 36.92 9.14 18 30.00 36.0 43.00 60 ▂▇▇▃▂
DailyRate 0 1 802.49 403.51 102 465.00 802.0 1157.00 1499 ▇▇▇▇▇
DistanceFromHome 0 1 9.19 8.11 1 2.00 7.0 14.00 29 ▇▅▂▂▂
EmployeeNumber 0 1 1024.87 602.02 1 491.25 1020.5 1555.75 2068 ▇▇▇▇▇
HourlyRate 0 1 65.89 20.33 30 48.00 66.0 83.75 100 ▇▇▇▇▇
MonthlyIncome 0 1 6502.93 4707.96 1009 2911.00 4919.0 8379.00 19999 ▇▅▂▁▂
MonthlyRate 0 1 14313.10 7117.79 2094 8047.00 14235.5 20461.50 26999 ▇▇▇▇▇
NumCompaniesWorked 0 1 2.69 2.50 0 1.00 2.0 4.00 9 ▇▃▂▂▁
PercentSalaryHike 0 1 15.21 3.66 11 12.00 14.0 18.00 25 ▇▅▃▂▁
TotalWorkingYears 0 1 11.28 7.78 0 6.00 10.0 15.00 40 ▇▇▂▁▁
YearsAtCompany 0 1 7.01 6.13 0 3.00 5.0 9.00 40 ▇▂▁▁▁
YearsInCurrentRole 0 1 4.23 3.62 0 2.00 3.0 7.00 18 ▇▃▂▁▁
YearsSinceLastPromotion 0 1 2.19 3.22 0 0.00 1.0 3.00 15 ▇▁▁▁▁
YearsWithCurrManager 0 1 4.12 3.57 0 2.00 3.0 7.00 17 ▇▂▅▁▁
HR_data_omit <- na.omit(HR_data_fix)


set.seed(123)
split_HR <- createDataPartition(HR_data_omit$`Attrition`, p = .7, 
                                     list = FALSE, 
                                     times = 1)

HR_train <- HR_data_omit[split_HR,]
HR_test <- HR_data_omit[-split_HR,]

5.2 Penalized Regression Approaches

In penalized regression approach we need to do standardization. But our data has some binary variables. So, I only applied standardization to numerical values. I have standardized the test and train data to be used in the penalized regression approach. In tree based approaches we do not need to use the standardized versions.

HR_train_standard <- HR_train %>% 
    mutate_if(is.numeric, scale)

HR_test_standard <- HR_test %>% 
    mutate_if(is.numeric, scale)
HR_train_matrix = matrix()
HR_test_matrix = matrix()
HR_target = matrix()

HR_train_matrix <-  data.matrix(HR_train_standard[,-c("Attrition")])
HR_test_matrix <-  data.matrix(HR_test_standard[,-c("Attrition")])

HR_target <- data.matrix (HR_train_standard$Attrition)
HR_target_test <- data.matrix (HR_test_standard$Attrition)

set.seed(123)
HR_PRA_cv = cv.glmnet(HR_train_matrix, HR_target, family="binomial", nfolds=5)


plot(HR_PRA_cv)

HR_lambda_min = HR_PRA_cv$lambda.min
HR_lambda_min
## [1] 0.003658811
HR_lambda_1se = HR_PRA_cv$lambda.1se
HR_lambda_1se
## [1] 0.01477068
HR_PRA <- glmnet(HR_train_matrix, HR_target,family="binomial", alpha=1,lambda=HR_PRA_cv$lambda.min)
summary(HR_PRA)
##            Length Class     Mode     
## a0          1     -none-    numeric  
## beta       31     dgCMatrix S4       
## df          1     -none-    numeric  
## dim         2     -none-    numeric  
## lambda      1     -none-    numeric  
## dev.ratio   1     -none-    numeric  
## nulldev     1     -none-    numeric  
## npasses     1     -none-    numeric  
## jerr        1     -none-    numeric  
## offset      1     -none-    logical  
## classnames  2     -none-    character
## call        6     -none-    call     
## nobs        1     -none-    numeric

5.3 Decision Trees

Using 10 fold cross validation

HR_tune_grid = expand.grid(min_leaf_obs=seq(3,10,1),complexity=seq(0,0.3,0.1)) 


HR_folds <- createFolds(1:nrow(HR_train), k = 10)
HR_cv_data = HR_train

HR_cv_param = tibble()
HR_all_cv_stat = data.table()

for(p in 1:nrow(HR_tune_grid)) {
  HR_temp_param = HR_tune_grid[p,]
  HR_temp_result = tibble()
  for(i in 1:10){
    HR_temp_test = HR_cv_data[HR_folds[[i]],]
    HR_temp_train = HR_cv_data[-HR_folds[[i]],]
    HR_temp_fit=rpart(Attrition~.,data = HR_temp_train, method="class", control = rpart.control(minbucket = HR_temp_param$min_leaf_obs,cp=HR_temp_param$complexity))
    HR_temp_test$Prediction = predict(HR_temp_fit, HR_temp_test, type="class")
    HR_temp_result=rbind( HR_temp_result, HR_temp_test)
  }
  HR_temp_stat = data.table(HR_temp_param, Accuracy= sum (HR_temp_test$Prediction==HR_temp_test$Attrition)/nrow(HR_temp_test))
  print(HR_temp_stat)
  HR_all_cv_stat = rbind(HR_all_cv_stat, HR_temp_stat)
}
##    min_leaf_obs complexity  Accuracy
## 1:            3          0 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            4          0 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            5          0 0.8365385
##    min_leaf_obs complexity  Accuracy
## 1:            6          0 0.8173077
##    min_leaf_obs complexity  Accuracy
## 1:            7          0 0.7788462
##    min_leaf_obs complexity  Accuracy
## 1:            8          0 0.7884615
##    min_leaf_obs complexity  Accuracy
## 1:            9          0 0.7980769
##    min_leaf_obs complexity  Accuracy
## 1:           10          0 0.7980769
##    min_leaf_obs complexity  Accuracy
## 1:            3        0.1 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            4        0.1 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            5        0.1 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            6        0.1 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            7        0.1 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            8        0.1 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            9        0.1 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:           10        0.1 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            3        0.2 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            4        0.2 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            5        0.2 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            6        0.2 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            7        0.2 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            8        0.2 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            9        0.2 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:           10        0.2 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            3        0.3 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            4        0.3 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            5        0.3 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            6        0.3 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            7        0.3 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            8        0.3 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:            9        0.3 0.8269231
##    min_leaf_obs complexity  Accuracy
## 1:           10        0.3 0.8269231

According to the cross validation results the best accuracy was obtained with the following parameters.

HR_best_DT = HR_all_cv_stat%>%arrange(-Accuracy)%>%head(1)
HR_best_DT
##    min_leaf_obs complexity  Accuracy
## 1:            5          0 0.8365385

Then I use these parameters that perform best for building the CART tree.

HR_DT = rpart(Attrition~. , data = HR_train, method="class", control = rpart.control(minbucket = HR_best_DT$min_leaf_obs,cp=HR_best_DT$complexity))


fancyRpartPlot(HR_DT)

As we can see that the tree generated is not too complex.

5.4 Random Forests

For RF we set only m (set other parameters as J=500 and the minimal number of observations per tree leaf=5)

For mtry square root of features are taken generally. As there are 32 variables I am doing cross validation for 5, 6 and 7.

set.seed(123)
HR_m=c(5,6,7)
J=500 
min_size_of_terminal_nodes=5
n_folds=10

HR_RF_fitControl=trainControl(method = "repeatedcv",
                           number = n_folds,
                           repeats = 1, classProbs = TRUE, search="random")  

HR_RF_grid=expand.grid(mtry = HR_m)

HR_RF_train <- train(Attrition~., data = HR_train, method = "rf", trControl = HR_RF_fitControl, ntree=J, nodesize = min_size_of_terminal_nodes, tuneGrid = HR_RF_grid)

print(HR_RF_train[["results"]])
##   mtry  Accuracy     Kappa AccuracySD    KappaSD
## 1    5 0.8514506 0.1446482 0.01380373 0.11599747
## 2    6 0.8524310 0.1619882 0.01349232 0.09460159
## 3    7 0.8524215 0.1590154 0.01763513 0.13454714

5.5 Stochastic Gradient Boosting

set.seed(123)

HR_GBM_control <- trainControl(method='cv',  number=5,  search='grid', summaryFunction = twoClassSummary, classProbs = T)


HR_GBM_grid <-  expand.grid(interaction.depth = c(2, 3, 5),  n.trees = c(50,100),  shrinkage = c(0.05,0.1), n.minobsinnode = 10)



print(HR_GBM_grid)
##    interaction.depth n.trees shrinkage n.minobsinnode
## 1                  2      50      0.05             10
## 2                  3      50      0.05             10
## 3                  5      50      0.05             10
## 4                  2     100      0.05             10
## 5                  3     100      0.05             10
## 6                  5     100      0.05             10
## 7                  2      50      0.10             10
## 8                  3      50      0.10             10
## 9                  5      50      0.10             10
## 10                 2     100      0.10             10
## 11                 3     100      0.10             10
## 12                 5     100      0.10             10
set.seed(123)
HR_GBM_train <- train(Attrition~., 
                 data = HR_train,
                 method = 'gbm',
                 metric = 'ROC',
                 trControl=HR_GBM_control,
                 tuneGrid = HR_GBM_grid,
                 verbose=FALSE)

print(HR_GBM_train)
## Stochastic Gradient Boosting 
## 
## 1030 samples
##   31 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 824, 823, 824, 825, 824 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  ROC        Sens       Spec     
##   0.05       2                   50      0.7404624  0.9918941  0.1270945
##   0.05       2                  100      0.7588339  0.9861137  0.1812834
##   0.05       3                   50      0.7496150  0.9872631  0.1513369
##   0.05       3                  100      0.7575291  0.9768450  0.1992870
##   0.05       5                   50      0.7496994  0.9895819  0.1511586
##   0.05       5                  100      0.7629605  0.9803199  0.2051693
##   0.10       2                   50      0.7617151  0.9861003  0.1869875
##   0.10       2                  100      0.7743626  0.9722140  0.2475936
##   0.10       3                   50      0.7598030  0.9745396  0.1934046
##   0.10       3                  100      0.7758482  0.9641215  0.2475936
##   0.10       5                   50      0.7659259  0.9675965  0.2413547
##   0.10       5                  100      0.7713414  0.9687458  0.2955437
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.
summary(HR_GBM_train)

##                                                               var    rel.inf
## MonthlyIncome                                       MonthlyIncome 12.7869559
## OverTimeYes                                           OverTimeYes 10.0230755
## Age                                                           Age  7.2209250
## DailyRate                                               DailyRate  7.2082878
## HourlyRate                                             HourlyRate  4.9710056
## MaritalStatusSingle                           MaritalStatusSingle  4.9515005
## DistanceFromHome                                 DistanceFromHome  4.5041845
## EmployeeNumber                                     EmployeeNumber  4.3631242
## BusinessTravelTravel_Frequently   BusinessTravelTravel_Frequently  3.5621742
## TotalWorkingYears                               TotalWorkingYears  3.4584936
## PercentSalaryHike                               PercentSalaryHike  3.0576645
## YearsWithCurrManager                         YearsWithCurrManager  2.6995844
## YearsAtCompany                                     YearsAtCompany  2.6825570
## YearsInCurrentRole                             YearsInCurrentRole  2.3647025
## WorkLifeBalance3                                 WorkLifeBalance3  1.9709473
## NumCompaniesWorked                             NumCompaniesWorked  1.9073885
## YearsSinceLastPromotion                   YearsSinceLastPromotion  1.9035499
## MonthlyRate                                           MonthlyRate  1.7543968
## JobRoleSales Executive                     JobRoleSales Executive  1.7015673
## EnvironmentSatisfaction4                 EnvironmentSatisfaction4  1.6908300
## JobSatisfaction4                                 JobSatisfaction4  1.4920081
## StockOptionLevel1                               StockOptionLevel1  1.4841186
## DepartmentResearch & Development DepartmentResearch & Development  1.2983130
## JobRoleResearch Scientist               JobRoleResearch Scientist  1.2192456
## RelationshipSatisfaction4               RelationshipSatisfaction4  0.9322413
## EducationFieldMarketing                   EducationFieldMarketing  0.8175701
## JobRoleLaboratory Technician         JobRoleLaboratory Technician  0.8074269
## EducationFieldTechnical Degree     EducationFieldTechnical Degree  0.6649801
## EducationFieldMedical                       EducationFieldMedical  0.6172724
## JobLevel3                                               JobLevel3  0.5573757
## GenderMale                                             GenderMale  0.5007384
## EnvironmentSatisfaction3                 EnvironmentSatisfaction3  0.4701174
## JobInvolvement3                                   JobInvolvement3  0.4306623
## WorkLifeBalance2                                 WorkLifeBalance2  0.4131238
## JobSatisfaction2                                 JobSatisfaction2  0.3842745
## RelationshipSatisfaction3               RelationshipSatisfaction3  0.3823399
## JobRoleSales Representative           JobRoleSales Representative  0.3621029
## JobLevel2                                               JobLevel2  0.3486336
## TrainingTimesLastYear2                     TrainingTimesLastYear2  0.2667205
## JobRoleHuman Resources                     JobRoleHuman Resources  0.2589413
## StockOptionLevel3                               StockOptionLevel3  0.2566517
## JobInvolvement2                                   JobInvolvement2  0.2425063
## Education3                                             Education3  0.2375658
## RelationshipSatisfaction2               RelationshipSatisfaction2  0.1873844
## StockOptionLevel2                               StockOptionLevel2  0.1761193
## JobInvolvement4                                   JobInvolvement4  0.1486218
## TrainingTimesLastYear5                     TrainingTimesLastYear5  0.1420719
## TrainingTimesLastYear6                     TrainingTimesLastYear6  0.1179573
## BusinessTravelTravel_Rarely           BusinessTravelTravel_Rarely  0.0000000
## DepartmentSales                                   DepartmentSales  0.0000000
## Education2                                             Education2  0.0000000
## Education4                                             Education4  0.0000000
## Education5                                             Education5  0.0000000
## EducationFieldLife Sciences           EducationFieldLife Sciences  0.0000000
## EducationFieldOther                           EducationFieldOther  0.0000000
## EnvironmentSatisfaction2                 EnvironmentSatisfaction2  0.0000000
## JobLevel4                                               JobLevel4  0.0000000
## JobLevel5                                               JobLevel5  0.0000000
## JobRoleManager                                     JobRoleManager  0.0000000
## JobRoleManufacturing Director       JobRoleManufacturing Director  0.0000000
## JobRoleResearch Director                 JobRoleResearch Director  0.0000000
## JobSatisfaction3                                 JobSatisfaction3  0.0000000
## MaritalStatusMarried                         MaritalStatusMarried  0.0000000
## PerformanceRating4                             PerformanceRating4  0.0000000
## TrainingTimesLastYear1                     TrainingTimesLastYear1  0.0000000
## TrainingTimesLastYear3                     TrainingTimesLastYear3  0.0000000
## TrainingTimesLastYear4                     TrainingTimesLastYear4  0.0000000
## WorkLifeBalance4                                 WorkLifeBalance4  0.0000000

5.6 Comparison

prediction_HR_PRA_test <- predict(HR_PRA, newx=HR_test_matrix, s=c("lambda.min"), type="class")
prediction_HR_PRA_train <-  predict(HR_PRA, newx=HR_train_matrix, s=c("lambda.min"), type="class")

cm_HR_PRA_test <- confusionMatrix(factor(prediction_HR_PRA_test), factor(HR_test$Attrition))
cm_HR_PRA_train <- confusionMatrix(factor(prediction_HR_PRA_train), factor(HR_train$Attrition))


prediction_HR_DT_test <- predict(HR_DT, HR_test, type="class")
prediction_HR_DT_train <- predict(HR_DT, HR_train, type="class")

cm_HR_DT_test <- confusionMatrix(factor(prediction_HR_DT_test), factor(HR_test$Attrition))
cm_HR_DT_train <- confusionMatrix(factor(prediction_HR_DT_train), factor(HR_train$Attrition))


prediction_HR_RF_test <- predict(HR_RF_train, HR_test, type="raw")
prediction_HR_RF_train <- predict(HR_RF_train, HR_train, type="raw")

cm_HR_RF_test <- confusionMatrix(factor(make.names(prediction_HR_RF_test)), factor(make.names(HR_test$Attrition)))
cm_HR_RF_train <- confusionMatrix(factor(make.names(prediction_HR_RF_train)), factor(make.names(HR_train$Attrition)))


prediction_HR_SGB_test <- predict(HR_GBM_train, HR_test, type="raw")
prediction_HR_SGB_train <- predict(HR_GBM_train, HR_train, type="raw")

cm_HR_SGB_test <- confusionMatrix(factor(make.names(prediction_HR_SGB_test)), factor(make.names(HR_test$Attrition)))
cm_HR_SGB_train <- confusionMatrix(factor(make.names(prediction_HR_SGB_train)), factor(make.names(HR_train$Attrition)))



HR_test_performances <- data.table(HR_PRA=cm_HR_PRA_test$overall["Accuracy"], HR_DT=cm_HR_DT_test$overall["Accuracy"], HR_RF=cm_HR_RF_test$overall["Accuracy"], HR_SGB=cm_HR_SGB_test$overall["Accuracy"])

HR_train_performances <- data.table(HR_PRA=cm_HR_PRA_train$overall["Accuracy"], HR_DT=cm_HR_DT_train$overall["Accuracy"], HR_RF=cm_HR_RF_train$overall["Accuracy"], HR_SGB=cm_HR_SGB_train$overall["Accuracy"])

HR_test_performances 
##       HR_PRA     HR_DT     HR_RF    HR_SGB
## 1: 0.8818182 0.8022727 0.8522727 0.8727273
HR_train_performances 
##       HR_PRA     HR_DT    HR_RF   HR_SGB
## 1: 0.8708738 0.9087379 0.961165 0.915534
  • The cross-validation error rate of different approaches are consistent with the test error rate.
  • Overall all methods performed well on this data as there were not too many features.
  • Best performing method was random forest in train data penalized regression in the test data.
  • Decision trees are known to create bias if there is class imbalance. In this data there was 5:1 class imbalance. That may be the reason why penalized regression worked better. Nonetheless, decision tree methods performed well.
  • Monthly salary is found important and used a lot for splits in tree based approaches as there may be a high cardinally issue there. This maybe another reason for the difference between test and train accuracy in tree based methods.
  • As penalized regression worked so well in this data set relationship in the data may be rather linear.
  • Among tree based methods SGB performed the best.

6. NEWS DATASET

6.1 Data Manipulation

First column gives the url of the online news. This columns is present only for descriptive purposes. To make a prediction I will drop this column since it is unnecessary. Also some other variables were found to have no variation. Or they were highly correlated. For example number of key words information is already available in the data set. Adding related features like average, min or max is not very meaningful. So, I removed those to clean the data.

news_data_drop <- select(news_data, -c("url", "timedelta", "n_non_stop_words", "kw_avg_max", "kw_avg_max" ,"kw_min_min", "kw_max_min","kw_avg_min", "kw_min_max", "kw_max_max", "kw_avg_max","kw_min_avg", "kw_max_avg", "kw_max_avg", "kw_avg_avg", "min_positive_polarity","max_positive_polarity","min_negative_polarity", "max_negative_polarity" ))

When I was exploring the data set I saw that all variables, even binary ones were stored as numeric. So I filtered and mutated variables that have unique values less than 5 as factor.

news_data_factors <- news_data_drop %>% 
  select_if(~n_distinct(.) <= 5)

colnames(news_data_factors)
##  [1] "data_channel_is_lifestyle"     "data_channel_is_entertainment"
##  [3] "data_channel_is_bus"           "data_channel_is_socmed"       
##  [5] "data_channel_is_tech"          "data_channel_is_world"        
##  [7] "weekday_is_monday"             "weekday_is_tuesday"           
##  [9] "weekday_is_wednesday"          "weekday_is_thursday"          
## [11] "weekday_is_friday"             "weekday_is_saturday"          
## [13] "weekday_is_sunday"             "is_weekend"
news_data_fix <- news_data_drop %>% 
  mutate_at(c("data_channel_is_lifestyle", "data_channel_is_entertainment", "data_channel_is_bus",    "data_channel_is_socmed","data_channel_is_tech","data_channel_is_world","weekday_is_monday","weekday_is_tuesday","weekday_is_wednesday", "weekday_is_thursday","weekday_is_friday","weekday_is_saturday", "weekday_is_sunday", "is_weekend"), factor)

When I checked the data with skimr() and str() I saw that there are no missing values. Also after transforming dummy encodings from numerical to factor data is cleaned.

median_shares<-median(news_data$shares)

news_data_binary <- news_data_fix %>%
  mutate(is_popular=as.numeric(shares> median_shares))

news_data_binary2 <- news_data_binary %>% 
  mutate_at(c("is_popular"), factor)

str(news_data_binary2)
## Classes 'data.table' and 'data.frame':   39644 obs. of  46 variables:
##  $ n_tokens_title               : num  12 9 9 9 13 10 8 12 11 10 ...
##  $ n_tokens_content             : num  219 255 211 531 1072 ...
##  $ n_unique_tokens              : num  0.664 0.605 0.575 0.504 0.416 ...
##  $ n_non_stop_unique_tokens     : num  0.815 0.792 0.664 0.666 0.541 ...
##  $ num_hrefs                    : num  4 3 3 9 19 2 21 20 2 4 ...
##  $ num_self_hrefs               : num  2 1 1 0 19 2 20 20 0 1 ...
##  $ num_imgs                     : num  1 1 1 1 20 0 20 20 0 1 ...
##  $ num_videos                   : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ average_token_length         : num  4.68 4.91 4.39 4.4 4.68 ...
##  $ num_keywords                 : num  5 4 6 7 7 9 10 9 7 5 ...
##  $ data_channel_is_lifestyle    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ data_channel_is_entertainment: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 1 1 1 ...
##  $ data_channel_is_bus          : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...
##  $ data_channel_is_socmed       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ data_channel_is_tech         : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 2 2 1 ...
##  $ data_channel_is_world        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
##  $ self_reference_min_shares    : num  496 0 918 0 545 8500 545 545 0 0 ...
##  $ self_reference_max_shares    : num  496 0 918 0 16000 8500 16000 16000 0 0 ...
##  $ self_reference_avg_sharess   : num  496 0 918 0 3151 ...
##  $ weekday_is_monday            : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ weekday_is_tuesday           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_wednesday         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_thursday          : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_friday            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_saturday          : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_sunday            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ is_weekend                   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LDA_00                       : num  0.5003 0.7998 0.2178 0.0286 0.0286 ...
##  $ LDA_01                       : num  0.3783 0.05 0.0333 0.4193 0.0288 ...
##  $ LDA_02                       : num  0.04 0.0501 0.0334 0.4947 0.0286 ...
##  $ LDA_03                       : num  0.0413 0.0501 0.0333 0.0289 0.0286 ...
##  $ LDA_04                       : num  0.0401 0.05 0.6822 0.0286 0.8854 ...
##  $ global_subjectivity          : num  0.522 0.341 0.702 0.43 0.514 ...
##  $ global_sentiment_polarity    : num  0.0926 0.1489 0.3233 0.1007 0.281 ...
##  $ global_rate_positive_words   : num  0.0457 0.0431 0.0569 0.0414 0.0746 ...
##  $ global_rate_negative_words   : num  0.0137 0.01569 0.00948 0.02072 0.01213 ...
##  $ rate_positive_words          : num  0.769 0.733 0.857 0.667 0.86 ...
##  $ rate_negative_words          : num  0.231 0.267 0.143 0.333 0.14 ...
##  $ avg_positive_polarity        : num  0.379 0.287 0.496 0.386 0.411 ...
##  $ avg_negative_polarity        : num  -0.35 -0.119 -0.467 -0.37 -0.22 ...
##  $ title_subjectivity           : num  0.5 0 0 0 0.455 ...
##  $ title_sentiment_polarity     : num  -0.188 0 0 0 0.136 ...
##  $ abs_title_subjectivity       : num  0 0.5 0.5 0.5 0.0455 ...
##  $ abs_title_sentiment_polarity : num  0.188 0 0 0 0.136 ...
##  $ shares                       : int  593 711 1500 1200 505 855 556 891 3600 710 ...
##  $ is_popular                   : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 2 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
skim(news_data_binary2)
Data summary
Name news_data_binary2
Number of rows 39644
Number of columns 46
_______________________
Column type frequency:
factor 15
numeric 31
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
data_channel_is_lifestyle 0 1 FALSE 2 0: 37545, 1: 2099
data_channel_is_entertainment 0 1 FALSE 2 0: 32587, 1: 7057
data_channel_is_bus 0 1 FALSE 2 0: 33386, 1: 6258
data_channel_is_socmed 0 1 FALSE 2 0: 37321, 1: 2323
data_channel_is_tech 0 1 FALSE 2 0: 32298, 1: 7346
data_channel_is_world 0 1 FALSE 2 0: 31217, 1: 8427
weekday_is_monday 0 1 FALSE 2 0: 32983, 1: 6661
weekday_is_tuesday 0 1 FALSE 2 0: 32254, 1: 7390
weekday_is_wednesday 0 1 FALSE 2 0: 32209, 1: 7435
weekday_is_thursday 0 1 FALSE 2 0: 32377, 1: 7267
weekday_is_friday 0 1 FALSE 2 0: 33943, 1: 5701
weekday_is_saturday 0 1 FALSE 2 0: 37191, 1: 2453
weekday_is_sunday 0 1 FALSE 2 0: 36907, 1: 2737
is_weekend 0 1 FALSE 2 0: 34454, 1: 5190
is_popular 0 1 FALSE 2 0: 20082, 1: 19562

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
n_tokens_title 0 1 10.40 2.11 2.00 9.00 10.00 12.00 23.00 ▁▇▇▁▁
n_tokens_content 0 1 546.51 471.11 0.00 246.00 409.00 716.00 8474.00 ▇▁▁▁▁
n_unique_tokens 0 1 0.55 3.52 0.00 0.47 0.54 0.61 701.00 ▇▁▁▁▁
n_non_stop_unique_tokens 0 1 0.69 3.26 0.00 0.63 0.69 0.75 650.00 ▇▁▁▁▁
num_hrefs 0 1 10.88 11.33 0.00 4.00 8.00 14.00 304.00 ▇▁▁▁▁
num_self_hrefs 0 1 3.29 3.86 0.00 1.00 3.00 4.00 116.00 ▇▁▁▁▁
num_imgs 0 1 4.54 8.31 0.00 1.00 1.00 4.00 128.00 ▇▁▁▁▁
num_videos 0 1 1.25 4.11 0.00 0.00 0.00 1.00 91.00 ▇▁▁▁▁
average_token_length 0 1 4.55 0.84 0.00 4.48 4.66 4.85 8.04 ▁▁▇▃▁
num_keywords 0 1 7.22 1.91 1.00 6.00 7.00 9.00 10.00 ▁▂▇▇▇
self_reference_min_shares 0 1 3998.76 19738.67 0.00 639.00 1200.00 2600.00 843300.00 ▇▁▁▁▁
self_reference_max_shares 0 1 10329.21 41027.58 0.00 1100.00 2800.00 8000.00 843300.00 ▇▁▁▁▁
self_reference_avg_sharess 0 1 6401.70 24211.33 0.00 981.19 2200.00 5200.00 843300.00 ▇▁▁▁▁
LDA_00 0 1 0.18 0.26 0.00 0.03 0.03 0.24 0.93 ▇▁▁▁▁
LDA_01 0 1 0.14 0.22 0.00 0.03 0.03 0.15 0.93 ▇▁▁▁▁
LDA_02 0 1 0.22 0.28 0.00 0.03 0.04 0.33 0.92 ▇▁▁▁▁
LDA_03 0 1 0.22 0.30 0.00 0.03 0.04 0.38 0.93 ▇▁▁▁▂
LDA_04 0 1 0.23 0.29 0.00 0.03 0.04 0.40 0.93 ▇▂▁▁▂
global_subjectivity 0 1 0.44 0.12 0.00 0.40 0.45 0.51 1.00 ▁▃▇▁▁
global_sentiment_polarity 0 1 0.12 0.10 -0.39 0.06 0.12 0.18 0.73 ▁▂▇▁▁
global_rate_positive_words 0 1 0.04 0.02 0.00 0.03 0.04 0.05 0.16 ▅▇▁▁▁
global_rate_negative_words 0 1 0.02 0.01 0.00 0.01 0.02 0.02 0.18 ▇▁▁▁▁
rate_positive_words 0 1 0.68 0.19 0.00 0.60 0.71 0.80 1.00 ▁▁▃▇▃
rate_negative_words 0 1 0.29 0.16 0.00 0.19 0.28 0.38 1.00 ▅▇▃▁▁
avg_positive_polarity 0 1 0.35 0.10 0.00 0.31 0.36 0.41 1.00 ▁▇▃▁▁
avg_negative_polarity 0 1 -0.26 0.13 -1.00 -0.33 -0.25 -0.19 0.00 ▁▁▂▇▃
title_subjectivity 0 1 0.28 0.32 0.00 0.00 0.15 0.50 1.00 ▇▂▂▁▂
title_sentiment_polarity 0 1 0.07 0.27 -1.00 0.00 0.00 0.15 1.00 ▁▁▇▂▁
abs_title_subjectivity 0 1 0.34 0.19 0.00 0.17 0.50 0.50 0.50 ▃▂▁▁▇
abs_title_sentiment_polarity 0 1 0.16 0.23 0.00 0.00 0.00 0.25 1.00 ▇▂▁▁▁
shares 0 1 3395.38 11626.95 1.00 946.00 1400.00 2800.00 843300.00 ▇▁▁▁▁

The popularity is continuous target variable in the original dataset as it can be seen histogram. But I split it from the median value and modeled as binary classification. Because there were a lot of binary dummy encoding and there very too many features. Modeling it as regression did not work well. As it can be seen from the barplot there is no class imbalance.

hist(news_data$shares, breaks=2000, xlim=c(0,15000))

ggplot(news_data_binary2, aes(is_popular)) + geom_bar()

There are over 40.000 observations in the news dataset it was taking too long to builds models so I have sampled 4000 observations from it. After the sampling I seperated 70% of the data for train and 30% for testing.

news_data_omit <- na.omit(news_data_binary2)

news_data_omit2 <- select(news_data_omit, -c("shares"))

news_data_omit2 = news_data_omit2[sample(nrow(news_data_omit2), 4000), ]

set.seed(123)
split_news <- createDataPartition(news_data_omit2$`is_popular`, p = .7, 
                                     list = FALSE, 
                                     times = 1)

news_train <- news_data_omit2[split_news,]
news_test <- news_data_omit2[-split_news,]

6.2 Penalized Regression Approaches

In penalized regression approach we need to do standardization. But our data has some binary variables. So, I only applied standardization to numerical values. I have standardized the test and train data to be used in the penalized regression approach. In tree based approaches we do not need to use the standardized versions.

news_train_standard <- news_train %>% 
    mutate_if(is.numeric, scale)

news_test_standard <- news_test %>% 
    mutate_if(is.numeric, scale)
news_train_matrix = matrix()
news_test_matrix = matrix()
news_target = matrix()

news_target_train <- data.matrix(news_train_standard$is_popular)
news_target_test <- data.matrix(news_test_standard$is_popular)

news_train_matrix <-  data.matrix(news_train_standard[,-c("is_popular")])
news_test_matrix <-  data.matrix(news_test_standard[,-c("is_popular")])



news_PRA_cv = cv.glmnet(news_train_matrix, news_target_train, family="binomial")
options(repr.plot.width=5, repr.plot.height=5)

plot(news_PRA_cv)

news_lambda_min = news_PRA_cv$lambda.min
news_lambda_min
## [1] 0.006113813
news_lambda_1se = news_PRA_cv$lambda.1se
news_lambda_1se
## [1] 0.01550074
news_PRA <- glmnet(news_train_matrix, news_target_train ,family="binomial", alpha=1,lambda=news_PRA_cv$lambda.min)
summary(news_PRA)
##            Length Class     Mode     
## a0          1     -none-    numeric  
## beta       44     dgCMatrix S4       
## df          1     -none-    numeric  
## dim         2     -none-    numeric  
## lambda      1     -none-    numeric  
## dev.ratio   1     -none-    numeric  
## nulldev     1     -none-    numeric  
## npasses     1     -none-    numeric  
## jerr        1     -none-    numeric  
## offset      1     -none-    logical  
## classnames  2     -none-    character
## call        6     -none-    call     
## nobs        1     -none-    numeric

Now we can calculate accuracy for train and test data using sum of squared errors.

6.3 Decision Trees

news_tune_grid = expand.grid(min_leaf_obs=seq(3,10,1),complexity=seq(0,0.3,0.1)) 


news_folds <- createFolds(1:nrow(news_train), k = 10)
news_cv_data = news_train

news_cv_param = tibble()
news_all_cv_stat = data.table()

for(p in 1:nrow(news_tune_grid)) {
  news_temp_param = news_tune_grid[p,]
  news_temp_result = tibble()
  for(i in 1:10){
    news_temp_test = news_cv_data[news_folds[[i]],]
    news_temp_train = news_cv_data[-news_folds[[i]],]
    news_temp_fit=rpart(is_popular~.,data = news_temp_train, method="class", control = rpart.control(minbucket = news_temp_param$min_leaf_obs,cp=news_temp_param$complexity))
    news_temp_test$Prediction = predict(news_temp_fit, news_temp_test, type="class")
    news_temp_result=rbind( news_temp_result, news_temp_test)
  }
  news_temp_stat = data.table(news_temp_param, Accuracy= sum (news_temp_test$Prediction==news_temp_test$is_popular)/nrow(news_temp_test))
  print(news_temp_stat)
  news_all_cv_stat = rbind(news_all_cv_stat, news_temp_stat)
}
##    min_leaf_obs complexity  Accuracy
## 1:            3          0 0.5785714
##    min_leaf_obs complexity  Accuracy
## 1:            4          0 0.5857143
##    min_leaf_obs complexity  Accuracy
## 1:            5          0 0.5928571
##    min_leaf_obs complexity Accuracy
## 1:            6          0    0.575
##    min_leaf_obs complexity  Accuracy
## 1:            7          0 0.5428571
##    min_leaf_obs complexity  Accuracy
## 1:            8          0 0.5821429
##    min_leaf_obs complexity Accuracy
## 1:            9          0    0.575
##    min_leaf_obs complexity  Accuracy
## 1:           10          0 0.5642857
##    min_leaf_obs complexity  Accuracy
## 1:            3        0.1 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            4        0.1 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            5        0.1 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            6        0.1 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            7        0.1 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            8        0.1 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            9        0.1 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:           10        0.1 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            3        0.2 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            4        0.2 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            5        0.2 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            6        0.2 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            7        0.2 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            8        0.2 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            9        0.2 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:           10        0.2 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            3        0.3 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            4        0.3 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            5        0.3 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            6        0.3 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            7        0.3 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            8        0.3 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:            9        0.3 0.4821429
##    min_leaf_obs complexity  Accuracy
## 1:           10        0.3 0.4821429
news_best_DT = news_all_cv_stat%>%arrange(-Accuracy) %>% head(1)
news_best_DT
##    min_leaf_obs complexity  Accuracy
## 1:            5          0 0.5928571

Then I use these parameters that perform best for building the CART tree.

news_DT = rpart(is_popular~.,data = news_train, method="class",control = rpart.control(minbucket = news_best_DT$min_leaf_obs,cp=news_best_DT$complexity))

fancyRpartPlot(news_DT)

6.4 Random Forests

For RF we set only m (set other parameters as J=500 and the minimal number of observations per tree leaf=5)

For mtry square root of features are taken generally. As there are around 50 variables I am doing cross validation for 7, 8 and 9.

set.seed(123)
news_m=c(7,8,9)
J=500 
min_size_of_terminal_nodes=5
n_folds=10

news_RF_fitControl=trainControl(method = "repeatedcv",
                           number = n_folds,
                           repeats = 1, classProbs = TRUE, search="random")  

news_RF_grid=expand.grid(mtry = news_m)

news_RF_train <- train(make.names(as.factor(is_popular))~., data = news_train, method = "rf", trControl = news_RF_fitControl, ntree=J, nodesize = min_size_of_terminal_nodes, tuneGrid = news_RF_grid)

print(news_RF_train[["results"]])
##   mtry  Accuracy     Kappa AccuracySD    KappaSD
## 1    7 0.6226398 0.2452905 0.02918176 0.05835767
## 2    8 0.6187112 0.2374175 0.02616970 0.05234155
## 3    9 0.6183516 0.2367177 0.02531512 0.05062853

6.5 Stochastic Gradient Boosting

set.seed(123)

news_GBM_control <- trainControl(method='cv',  number=5,  search='grid', summaryFunction = twoClassSummary, classProbs = T)


news_GBM_grid <-  expand.grid(interaction.depth = c(2, 3, 5),  n.trees = c(50,100),  shrinkage = c(0.05,0.1), n.minobsinnode = 10)



print(news_GBM_grid)
##    interaction.depth n.trees shrinkage n.minobsinnode
## 1                  2      50      0.05             10
## 2                  3      50      0.05             10
## 3                  5      50      0.05             10
## 4                  2     100      0.05             10
## 5                  3     100      0.05             10
## 6                  5     100      0.05             10
## 7                  2      50      0.10             10
## 8                  3      50      0.10             10
## 9                  5      50      0.10             10
## 10                 2     100      0.10             10
## 11                 3     100      0.10             10
## 12                 5     100      0.10             10
set.seed(123)
news_GBM_train <- train(make.names(as.factor(is_popular))~., 
                 data = news_train,
                 method = 'gbm',
                 metric = 'ROC',
                 trControl=news_GBM_control,
                 tuneGrid = news_GBM_grid,
                 verbose=FALSE)

print(news_GBM_train)
## Stochastic Gradient Boosting 
## 
## 2801 samples
##   44 predictor
##    2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 2241, 2241, 2241, 2241, 2240 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  ROC        Sens       Spec     
##   0.05       2                   50      0.6740543  0.6171429  0.6224021
##   0.05       2                  100      0.6765625  0.6221429  0.6245653
##   0.05       3                   50      0.6778436  0.6014286  0.6323894
##   0.05       3                  100      0.6794994  0.6142857  0.6331291
##   0.05       5                   50      0.6782237  0.6142857  0.6295348
##   0.05       5                  100      0.6827042  0.6092857  0.6252745
##   0.10       2                   50      0.6808680  0.6285714  0.6259736
##   0.10       2                  100      0.6798299  0.6235714  0.6231342
##   0.10       3                   50      0.6765526  0.6221429  0.6188612
##   0.10       3                  100      0.6746729  0.6278571  0.6088409
##   0.10       5                   50      0.6710527  0.6100000  0.6117184
##   0.10       5                  100      0.6681641  0.6192857  0.6088612
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100, interaction.depth =
##  5, shrinkage = 0.05 and n.minobsinnode = 10.
summary(news_GBM_train)

##                                                           var   rel.inf
## self_reference_min_shares           self_reference_min_shares 8.3351084
## is_weekend1                                       is_weekend1 7.7612017
## LDA_02                                                 LDA_02 7.1272192
## LDA_01                                                 LDA_01 5.0884619
## average_token_length                     average_token_length 4.9510496
## num_hrefs                                           num_hrefs 4.9038212
## global_sentiment_polarity           global_sentiment_polarity 4.4833163
## LDA_00                                                 LDA_00 4.0990566
## data_channel_is_entertainment1 data_channel_is_entertainment1 4.0989450
## n_non_stop_unique_tokens             n_non_stop_unique_tokens 4.0747828
## self_reference_avg_sharess         self_reference_avg_sharess 3.8403526
## global_rate_positive_words         global_rate_positive_words 3.6384221
## data_channel_is_world1                 data_channel_is_world1 3.0886016
## global_subjectivity                       global_subjectivity 3.0449110
## data_channel_is_socmed1               data_channel_is_socmed1 2.9876292
## n_unique_tokens                               n_unique_tokens 2.5487488
## rate_negative_words                       rate_negative_words 2.2435258
## LDA_03                                                 LDA_03 2.2270708
## avg_negative_polarity                   avg_negative_polarity 2.1838811
## num_imgs                                             num_imgs 2.1811247
## n_tokens_content                             n_tokens_content 2.1016394
## title_sentiment_polarity             title_sentiment_polarity 1.8170108
## self_reference_max_shares           self_reference_max_shares 1.6757440
## abs_title_subjectivity                 abs_title_subjectivity 1.5563534
## LDA_04                                                 LDA_04 1.5230991
## rate_positive_words                       rate_positive_words 1.2361337
## avg_positive_polarity                   avg_positive_polarity 1.2054783
## global_rate_negative_words         global_rate_negative_words 1.1606496
## n_tokens_title                                 n_tokens_title 1.1451287
## title_subjectivity                         title_subjectivity 0.7022705
## num_keywords                                     num_keywords 0.6062751
## abs_title_sentiment_polarity     abs_title_sentiment_polarity 0.5167960
## num_videos                                         num_videos 0.4054380
## num_self_hrefs                                 num_self_hrefs 0.3947495
## weekday_is_friday1                         weekday_is_friday1 0.2863353
## weekday_is_sunday1                         weekday_is_sunday1 0.1890140
## weekday_is_tuesday1                       weekday_is_tuesday1 0.1665105
## weekday_is_wednesday1                   weekday_is_wednesday1 0.1487118
## data_channel_is_tech1                   data_channel_is_tech1 0.1280160
## weekday_is_thursday1                     weekday_is_thursday1 0.1274161
## data_channel_is_lifestyle1         data_channel_is_lifestyle1 0.0000000
## data_channel_is_bus1                     data_channel_is_bus1 0.0000000
## weekday_is_monday1                         weekday_is_monday1 0.0000000
## weekday_is_saturday1                     weekday_is_saturday1 0.0000000

6.6 Comparison

prediction_news_PRA_test <- predict(news_PRA, newx=news_test_matrix, s=c("lambda.min"), type="class")
prediction_news_PRA_train <-  predict(news_PRA, newx=news_train_matrix, s=c("lambda.min"), type="class")

cm_news_PRA_test <- confusionMatrix(factor(prediction_news_PRA_test), factor(news_test$is_popular))
cm_news_PRA_train <- confusionMatrix(factor(prediction_news_PRA_train), factor(news_train$is_popular))


prediction_news_DT_test <- predict(news_DT, news_test, type="class")
prediction_news_DT_train <- predict(news_DT, news_train, type="class")

cm_news_DT_test <- confusionMatrix(factor(prediction_news_DT_test), factor(news_test$is_popular))
cm_news_DT_train <- confusionMatrix(factor(prediction_news_DT_train), factor(news_train$is_popular))


prediction_news_RF_test <- predict(news_RF_train, news_test, type="raw")
prediction_news_RF_train <- predict(news_RF_train, news_train, type="raw")

cm_news_RF_test <- confusionMatrix(factor(prediction_news_RF_test), factor(make.names(news_test$is_popular)))
cm_news_RF_train <- confusionMatrix(factor(prediction_news_RF_train), factor(make.names(news_train$is_popular)))


prediction_news_SGB_test <- predict(news_GBM_train, news_test, type="raw")
prediction_news_SGB_train <- predict(news_GBM_train, news_train, type="raw")

cm_news_SGB_test <- confusionMatrix(factor(prediction_news_SGB_test), factor(make.names(news_test$is_popular)))
cm_news_SGB_train <- confusionMatrix(factor(prediction_news_SGB_train), factor(make.names(news_train$is_popular)))



news_test_performances <- data.table(news_PRA_test=cm_news_PRA_test$overall["Accuracy"], news_DT_test=cm_news_DT_test$overall["Accuracy"], news_RF_test=cm_news_RF_test$overall["Accuracy"], news_SGB_test=cm_news_SGB_test$overall["Accuracy"])

news_train_performances <- data.table(news_PRA_train=cm_news_PRA_train$overall["Accuracy"], news_DT_train=cm_news_DT_train$overall["Accuracy"], news_RF_train=cm_news_RF_train$overall["Accuracy"], news_SGB_train=cm_news_SGB_train$overall["Accuracy"])

news_test_performances
##    news_PRA_test news_DT_test news_RF_test news_SGB_test
## 1:      0.616347    0.5629691    0.6422018       0.64804
news_train_performances
##    news_PRA_train news_DT_train news_RF_train news_SGB_train
## 1:      0.6301321     0.8543377      0.998929      0.7065334
  • Accuracy values are lower in general for this dataset.In this data set target was originally continuous but accuracy was so bad that I modeled it as binary by splitting from median. And it become a binary classification problem.
  • SGB performed the best. It’s accuracy is highest and also its train and test accuracy are very close.
  • In train data accuracy for random forest is almost perfect but it is not the case in test data. Random forest overfitted to the train data.
  • The cross-validation error rate of different approaches are not really consistent with the test error rate for RF an CART
  • Also for decision tree the difference in
  • Penalized regression did not work well. The relationships may be not highly nonlinear.
  • In this data there was no class imbalance, I expected the tree based methods to work a lot better that penalized regression but the accuracy values are not that different. It may be because of high number of binary variables in the data set. This may have caused problem with high cardinality.

7 CONCLUSION

According to results for 4 dataset my interpretations are as follows:

  • Test and cross validation results are close to each other in most cases. Except a few over fitting cases.
  • Random Forest and CART tend to over fit to the train data.
  • With large feature sets SGB performed better compared to other tree based methods.
  • When there is class imbalance, tree based approaches showed bias. Penalized regression worked better.
  • I expected penalized regression approach to work well for regression. With tree based approaches extrapolation can be a problem. But the data waas nonlinear and Penalized regression work better when relationships are not highly nonlinear.
  • CART is tend to over fit even more in regression setting.